Postnatal environmental exposures, particularly those found in household products and dietary intake, along with specific serum metabolomics profiles, are significantly associated with the BMI Z-score of children aged 6-11 years. Higher concentrations of certain metabolites in serum, reflecting exposure to chemical classes or metals, will correlate with variations in BMI Z-score, controlling for age and other relevant covariates. Some metabolites associated with chemical exposures and dietary patterns can serve as biomarkers for the risk of developing obesity.
Research indicates that postnatal exposure to endocrine-disrupting chemicals (EDCs) such as phthalates, bisphenol A (BPA), and polychlorinated biphenyls (PCBs) can significantly influence body weight and metabolic health (Junge et al., 2018). These chemicals, commonly found in household products and absorbed through dietary intake, are linked to detrimental effects on body weight and metabolic health in children. This hormonal interference can lead to an increased body mass index (BMI) in children, suggesting a potential pathway through which exposure to these chemicals contributes to the development of obesity.
A longitudinal study on Japanese children examined the impact of postnatal exposure (first two years of life) to p,p’-dichlorodiphenyltrichloroethane (p,p’-DDT) and p,p’-dichlorodiphenyldichloroethylene (p,p’-DDE) through breastfeeding (Plouffe et al., 2020). The findings revealed that higher levels of these chemicals in breast milk were associated with increased BMI at 42 months of age. DDT and DDE may interfere with hormonal pathways related to growth and development. These chemicals can mimic or disrupt hormones that regulate metabolism and fat accumulation. This study highlights the importance of understanding how persistent organic pollutants can affect early childhood growth and development.
The study by Harley et al. (2013) investigates the association between prenatal and postnatal Bisphenol A (BPA) exposure and various body composition metrics in children aged 9 years from the CHAMACOS cohort. The study found that higher prenatal BPA exposure was linked to a decrease in BMI and body fat percentages in girls but not boys, suggesting sex-specific effects. Conversely, BPA levels measured at age 9 were positively associated with increased adiposity in both genders, highlighting the different impacts of exposure timing on childhood development.
The 2022 study 2022 study by Uldbjerg et al. explored the effects of combined exposures to multiple EDCs, suggesting that mixtures of these chemicals can have additive or synergistic effects on BMI and obesity risk. Humans are typically exposed to a mixture of chemicals rather than individual EDCs, making it crucial to understand how these mixtures might interact. The research highlighted that the interaction between different EDCs can lead to additive (where the effects simply add up) or even synergistic (where the combined effect is greater than the sum of their separate effects) outcomes. These interactions can significantly amplify the risk factors associated with obesity and metabolic disorders in children. The dose-response relationship found that even low-level exposure to multiple EDCs could result in significant health impacts due to their combined effects.
These studies collectively illustrate the critical role of environmental EDCs in shaping metabolic health outcomes in children, highlighting the necessity for ongoing research and policy intervention to mitigate these risks.
This study utilizes data from the subcohort of 1301 mother-child pairs in the HELIX study, who are which aged 6-11 years for whom complete exposure and outcome data were available. Exposure data included detailed dietary records after pregnancy and concentrations of various chemicals like BPA and PCBs in child blood samples. There are categorical and numerical variables, which will include both demographic details and biochemical measurements. This dataset allows for robust statistical analysis to identify potential associations between EDC exposure and changes in BMI Z-scores, considering confounding factors such as age, gender, and socioeconomic status. There are no missing data so there is not need to impute the information. Child BMI Z-scores were calculated based on WHO growth standards.
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")
# specific covariates
filtered_covariates <- codebook %>%
filter(domain == "Covariates" &
variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "hs_child_age_None"))
#specific phenotype variables
filtered_phenotype <- codebook %>%
filter(domain == "Phenotype" &
variable_name %in% c("hs_zbmi_who"))
# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
kable(combined_codebook, align = "c", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
| variable_name | domain | family | subfamily | period | location | period_postnatal | description | var_type | transformation | labels | labelsshort | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| h_bfdur_Ter | h_bfdur_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Breastfeeding duration (weeks) | factor | Tertiles | Breastfeeding | Breastfeeding |
| hs_bakery_prod_Ter | hs_bakery_prod_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: bakery products (hs_cookies + hs_pastries) | factor | Tertiles | Bakery prod | BakeProd |
| hs_beverages_Ter | hs_beverages_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: beverages (hs_dietsoda+hs_soda) | factor | Tertiles | Soda | Soda |
| hs_break_cer_Ter | hs_break_cer_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: breakfast cereal (hs_sugarcer+hs_othcer) | factor | Tertiles | BF cereals | BFcereals |
| hs_caff_drink_Ter | hs_caff_drink_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Drinks a caffeinated or æenergy drink (eg coca-cola, diet-coke, redbull) | factor | Tertiles | Caffeine | Caffeine |
| hs_dairy_Ter | hs_dairy_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: dairy (hs_cheese + hs_milk + hs_yogurt+ hs_probiotic+ hs_desert) | factor | Tertiles | Dairy | Dairy |
| hs_fastfood_Ter | hs_fastfood_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Visits a fast food restaurant/take away | factor | Tertiles | Fastfood | Fastfood |
| hs_KIDMED_None | hs_KIDMED_None | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Sum of KIDMED indices, without index9 | numeric | None | KIDMED | KIDMED |
| hs_mvpa_prd_alt_None | hs_mvpa_prd_alt_None | Lifestyles | Lifestyle | Physical activity | Postnatal | NA | NA | Clean & Over-reporting of Moderate-to-Vigorous Physical Activity (min/day) | numeric | None | PA | PA |
| hs_org_food_Ter | hs_org_food_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Eats organic food | factor | Tertiles | Organicfood | Organicfood |
| hs_proc_meat_Ter | hs_proc_meat_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: processed meat (hs_coldmeat+hs_ham) | factor | Tertiles | Processed meat | ProcMeat |
| hs_readymade_Ter | hs_readymade_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Eats a æready-made supermarket meal | factor | Tertiles | Ready made food | ReadyFood |
| hs_sd_wk_None | hs_sd_wk_None | Lifestyles | Lifestyle | Physical activity | Postnatal | NA | NA | sedentary behaviour (min/day) | numeric | None | Sedentary | Sedentary |
| hs_total_bread_Ter | hs_total_bread_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: bread (hs_darkbread+hs_whbread) | factor | Tertiles | Bread | Bread |
| hs_total_cereal_Ter | hs_total_cereal_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: cereal (hs_darkbread + hs_whbread + hs_rice_pasta + hs_sugarcer + hs_othcer + hs_rusks) | factor | Tertiles | Cereals | Cereals |
| hs_total_fish_Ter | hs_total_fish_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: fish and seafood (hs_canfish+hs_oilyfish+hs_whfish+hs_seafood) | factor | Tertiles | Fish | Fish |
| hs_total_fruits_Ter | hs_total_fruits_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: fruits (hs_canfruit+hs_dryfruit+hs_freshjuice+hs_fruits) | factor | Tertiles | Fruits | Fruits |
| hs_total_lipids_Ter | hs_total_lipids_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: Added fat | factor | Tertiles | Diet fat | Diet fat |
| hs_total_meat_Ter | hs_total_meat_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: meat (hs_coldmeat+hs_ham+hs_poultry+hs_redmeat) | factor | Tertiles | Meat | Meat |
| hs_total_potatoes_Ter | hs_total_potatoes_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: potatoes (hs_frenchfries+hs_potatoes) | factor | Tertiles | Potatoes | Potatoes |
| hs_total_sweets_Ter | hs_total_sweets_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: sweets (hs_choco + hs_sweets + hs_sugar) | factor | Tertiles | Sweets | Sweets |
| hs_total_veg_Ter | hs_total_veg_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: vegetables (hs_cookveg+hs_rawveg) | factor | Tertiles | Vegetables | Vegetables |
| hs_total_yog_Ter | hs_total_yog_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: yogurt (hs_yogurt+hs_probiotic) | factor | Tertiles | Yogurt | Yogurt |
| hs_dif_hours_total_None | hs_dif_hours_total_None | Lifestyles | Lifestyle | Sleep | Postnatal | NA | NA | Total hours of sleep (mean weekdays and night) | numeric | None | Sleep | Sleep |
| hs_as_c_Log2 | hs_as_c_Log2 | Chemicals | Metals | As | Postnatal | NA | NA | Arsenic (As) in child | numeric | Logarithm base 2 | As | As |
| hs_cd_c_Log2 | hs_cd_c_Log2 | Chemicals | Metals | Cd | Postnatal | NA | NA | Cadmium (Cd) in child | numeric | Logarithm base 2 | Cd | Cd |
| hs_co_c_Log2 | hs_co_c_Log2 | Chemicals | Metals | Co | Postnatal | NA | NA | Cobalt (Co) in child | numeric | Logarithm base 2 | Co | Co |
| hs_cs_c_Log2 | hs_cs_c_Log2 | Chemicals | Metals | Cs | Postnatal | NA | NA | Caesium (Cs) in child | numeric | Logarithm base 2 | Cs | Cs |
| hs_cu_c_Log2 | hs_cu_c_Log2 | Chemicals | Metals | Cu | Postnatal | NA | NA | Copper (Cu) in child | numeric | Logarithm base 2 | Cu | Cu |
| hs_hg_c_Log2 | hs_hg_c_Log2 | Chemicals | Metals | Hg | Postnatal | NA | NA | Mercury (Hg) in child | numeric | Logarithm base 2 | Hg | Hg |
| hs_mn_c_Log2 | hs_mn_c_Log2 | Chemicals | Metals | Mn | Postnatal | NA | NA | Manganese (Mn) in child | numeric | Logarithm base 2 | Mn | Mn |
| hs_mo_c_Log2 | hs_mo_c_Log2 | Chemicals | Metals | Mo | Postnatal | NA | NA | Molybdenum (Mo) in child | numeric | Logarithm base 2 | Mo | Mo |
| hs_pb_c_Log2 | hs_pb_c_Log2 | Chemicals | Metals | Pb | Postnatal | NA | NA | Lead (Pb) in child | numeric | Logarithm base 2 | Pb | Pb |
| hs_tl_cdich_None | hs_tl_cdich_None | Chemicals | Metals | Tl | Postnatal | NA | NA | Dichotomous variable of thallium (Tl) in child | factor | None | Tl | Tl |
| hs_dde_cadj_Log2 | hs_dde_cadj_Log2 | Chemicals | Organochlorines | DDE | Postnatal | NA | NA | Dichlorodiphenyldichloroethylene (DDE) in child adjusted for lipids | numeric | Logarithm base 2 | DDE | DDE |
| hs_ddt_cadj_Log2 | hs_ddt_cadj_Log2 | Chemicals | Organochlorines | DDT | Postnatal | NA | NA | Dichlorodiphenyltrichloroethane (DDT) in child adjusted for lipids | numeric | Logarithm base 2 | DDT | DDT |
| hs_hcb_cadj_Log2 | hs_hcb_cadj_Log2 | Chemicals | Organochlorines | HCB | Postnatal | NA | NA | Hexachlorobenzene (HCB) in child adjusted for lipids | numeric | Logarithm base 2 | HCB | HCB |
| hs_pcb118_cadj_Log2 | hs_pcb118_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl -118 (PCB-118) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 118 | PCB118 |
| hs_pcb138_cadj_Log2 | hs_pcb138_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-138 (PCB-138) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 138 | PCB138 |
| hs_pcb153_cadj_Log2 | hs_pcb153_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-153 (PCB-153) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 153 | PCB153 |
| hs_pcb170_cadj_Log2 | hs_pcb170_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-170 (PCB-170) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 170 | PCB170 |
| hs_pcb180_cadj_Log2 | hs_pcb180_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-180 (PCB-180) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 180 | PCB180 |
| hs_sumPCBs5_cadj_Log2 | hs_sumPCBs5_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Sum of PCBs in child adjusted for lipids (4 cohorts) | numeric | Logarithm base 2 | PCBs | SumPCB |
| hs_dep_cadj_Log2 | hs_dep_cadj_Log2 | Chemicals | Organophosphate pesticides | DEP | Postnatal | NA | NA | Diethyl phosphate (DEP) in child adjusted for creatinine | numeric | Logarithm base 2 | DEP | DEP |
| hs_detp_cadj_Log2 | hs_detp_cadj_Log2 | Chemicals | Organophosphate pesticides | DETP | Postnatal | NA | NA | Diethyl thiophosphate (DETP) in child adjusted for creatinine | numeric | Logarithm base 2 | DETP | DETP |
| hs_dmdtp_cdich_None | hs_dmdtp_cdich_None | Chemicals | Organophosphate pesticides | DMDTP | Postnatal | NA | NA | Dichotomous variable of dimethyl dithiophosphate (DMDTP) in child | factor | None | DMDTP | DMDTP |
| hs_dmp_cadj_Log2 | hs_dmp_cadj_Log2 | Chemicals | Organophosphate pesticides | DMP | Postnatal | NA | NA | Dimethyl phosphate (DMP) in child adjusted for creatinine | numeric | Logarithm base 2 | DMP | DMP |
| hs_dmtp_cadj_Log2 | hs_dmtp_cadj_Log2 | Chemicals | Organophosphate pesticides | DMTP | Postnatal | NA | NA | Dimethyl thiophosphate (DMTP) in child adjusted for creatinine | numeric | Logarithm base 2 | DMDTP | DMTP |
| hs_pbde153_cadj_Log2 | hs_pbde153_cadj_Log2 | Chemicals | Polybrominated diphenyl ethers (PBDE) | PBDE153 | Postnatal | NA | NA | Polybrominated diphenyl ether-153 (PBDE-153) in child adjusted for lipids | numeric | Logarithm base 2 | PBDE 153 | PBDE153 |
| hs_pbde47_cadj_Log2 | hs_pbde47_cadj_Log2 | Chemicals | Polybrominated diphenyl ethers (PBDE) | PBDE47 | Postnatal | NA | NA | Polybrominated diphenyl ether-47 (PBDE-47) in child adjusted for lipids | numeric | Logarithm base 2 | PBDE 47 | PBDE47 |
| hs_pfhxs_c_Log2 | hs_pfhxs_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFHXS | Postnatal | NA | NA | Perfluorohexane sulfonate (PFHXS) in child | numeric | Logarithm base 2 | PFHXS | PFHXS |
| hs_pfna_c_Log2 | hs_pfna_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFNA | Postnatal | NA | NA | Perfluorononanoate (PFNA) in child | numeric | Logarithm base 2 | PFNA | PFNA |
| hs_pfoa_c_Log2 | hs_pfoa_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFOA | Postnatal | NA | NA | Perfluorooctanoate (PFOA) in child | numeric | Logarithm base 2 | PFOA | PFOA |
| hs_pfos_c_Log2 | hs_pfos_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFOS | Postnatal | NA | NA | Perfluorooctane sulfonate (PFOS) in child | numeric | Logarithm base 2 | PFOS | PFOS |
| hs_pfunda_c_Log2 | hs_pfunda_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFUNDA | Postnatal | NA | NA | Perfluoroundecanoate (PFUNDA) in child | numeric | Logarithm base 2 | PFUNDA | PFUNDA |
| hs_bpa_cadj_Log2 | hs_bpa_cadj_Log2 | Chemicals | Phenols | BPA | Postnatal | NA | NA | Bisphenol A (BPA) in child adjusted for creatinine | numeric | Logarithm base 2 | BPA | BPA |
| hs_bupa_cadj_Log2 | hs_bupa_cadj_Log2 | Chemicals | Phenols | BUPA | Postnatal | NA | NA | N-Butyl paraben (BUPA) in child adjusted for creatinine | numeric | Logarithm base 2 | BUPA | BUPA |
| hs_etpa_cadj_Log2 | hs_etpa_cadj_Log2 | Chemicals | Phenols | ETPA | Postnatal | NA | NA | Ethyl paraben (ETPA) in child adjusted for creatinine | numeric | Logarithm base 2 | ETPA | ETPA |
| hs_mepa_cadj_Log2 | hs_mepa_cadj_Log2 | Chemicals | Phenols | MEPA | Postnatal | NA | NA | Methyl paraben (MEPA) in child adjusted for creatinine | numeric | Logarithm base 2 | MEPA | MEPA |
| hs_oxbe_cadj_Log2 | hs_oxbe_cadj_Log2 | Chemicals | Phenols | OXBE | Postnatal | NA | NA | Oxybenzone (OXBE) in child adjusted for creatinine | numeric | Logarithm base 2 | OXBE | OXBE |
| hs_prpa_cadj_Log2 | hs_prpa_cadj_Log2 | Chemicals | Phenols | PRPA | Postnatal | NA | NA | Propyl paraben (PRPA) in child adjusted for creatinine | numeric | Logarithm base 2 | PRPA | PRPA |
| hs_trcs_cadj_Log2 | hs_trcs_cadj_Log2 | Chemicals | Phenols | TRCS | Postnatal | NA | NA | Triclosan (TRCS) in child adjusted for creatinine | numeric | Logarithm base 2 | TRCS | TRCS |
| hs_mbzp_cadj_Log2 | hs_mbzp_cadj_Log2 | Chemicals | Phthalates | MBZP | Postnatal | NA | NA | Mono benzyl phthalate (MBzP) in child adjusted for creatinine | numeric | Logarithm base 2 | MBZP | MBZP |
| hs_mecpp_cadj_Log2 | hs_mecpp_cadj_Log2 | Chemicals | Phthalates | MECPP | Postnatal | NA | NA | Mono-2-ethyl 5-carboxypentyl phthalate (MECPP) in child adjusted for creatinine | numeric | Logarithm base 2 | MECPP | MECPP |
| hs_mehhp_cadj_Log2 | hs_mehhp_cadj_Log2 | Chemicals | Phthalates | MEHHP | Postnatal | NA | NA | Mono-2-ethyl-5-hydroxyhexyl phthalate (MEHHP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEHHP | MEHHP |
| hs_mehp_cadj_Log2 | hs_mehp_cadj_Log2 | Chemicals | Phthalates | MEHP | Postnatal | NA | NA | Mono-2-ethylhexyl phthalate (MEHP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEHP | MEHP |
| hs_meohp_cadj_Log2 | hs_meohp_cadj_Log2 | Chemicals | Phthalates | MEOHP | Postnatal | NA | NA | Mono-2-ethyl-5-oxohexyl phthalate (MEOHP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEOHP | MEOHP |
| hs_mep_cadj_Log2 | hs_mep_cadj_Log2 | Chemicals | Phthalates | MEP | Postnatal | NA | NA | Monoethyl phthalate (MEP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEP | MEP |
| hs_mibp_cadj_Log2 | hs_mibp_cadj_Log2 | Chemicals | Phthalates | MIBP | Postnatal | NA | NA | Mono-iso-butyl phthalate (MiBP) in child adjusted for creatinine | numeric | Logarithm base 2 | MIBP | MIBP |
| hs_mnbp_cadj_Log2 | hs_mnbp_cadj_Log2 | Chemicals | Phthalates | MNBP | Postnatal | NA | NA | Mono-n-butyl phthalate (MnBP) in child adjusted for creatinine | numeric | Logarithm base 2 | MNBP | MNBP |
| hs_ohminp_cadj_Log2 | hs_ohminp_cadj_Log2 | Chemicals | Phthalates | OHMiNP | Postnatal | NA | NA | Mono-4-methyl-7-hydroxyoctyl phthalate (OHMiNP) in child adjusted for creatinine | numeric | Logarithm base 2 | OHMiNP | OHMiNP |
| hs_oxominp_cadj_Log2 | hs_oxominp_cadj_Log2 | Chemicals | Phthalates | OXOMINP | Postnatal | NA | NA | Mono-4-methyl-7-oxooctyl phthalate (OXOMiNP) in child adjusted for creatinine | numeric | Logarithm base 2 | OXOMINP | OXOMINP |
| hs_sumDEHP_cadj_Log2 | hs_sumDEHP_cadj_Log2 | Chemicals | Phthalates | DEHP | Postnatal | NA | NA | Sum of DEHP metabolites (µg/g) in child adjusted for creatinine | numeric | Logarithm base 2 | DEHP | SumDEHP |
| FAS_cat_None | FAS_cat_None | Chemicals | Social and economic capital | Economic capital | Postnatal | NA | NA | Family affluence score | factor | None | Family affluence | FamAfl |
| hs_contactfam_3cat_num_None | hs_contactfam_3cat_num_None | Chemicals | Social and economic capital | Social capital | Postnatal | NA | NA | scoial capital: family friends | factor | None | Social contact | SocCont |
| hs_hm_pers_None | hs_hm_pers_None | Chemicals | Social and economic capital | Social capital | Postnatal | NA | NA | How many people live in your home? | numeric | None | House crowding | HouseCrow |
| hs_participation_3cat_None | hs_participation_3cat_None | Chemicals | Social and economic capital | Social capital | Postnatal | NA | NA | social capital: structural | factor | None | Social participation | SocPartic |
| hs_cotinine_cdich_None | hs_cotinine_cdich_None | Chemicals | Tobacco Smoke | Cotinine | Postnatal | NA | NA | Dichotomous variable of cotinine in child | factor | None | Cotinine | Cotinine |
| hs_globalexp2_None | hs_globalexp2_None | Chemicals | Tobacco Smoke | Tobacco Smoke | Postnatal | NA | NA | Global exposure of the child to ETS (2 categories) | factor | None | ETS | ETS |
| hs_smk_parents_None | hs_smk_parents_None | Chemicals | Tobacco Smoke | Tobacco Smoke | Postnatal | NA | NA | Tobacco Smoke status of parents (both) | factor | None | Smoking_parents | SmokPar |
| e3_sex_None | e3_sex_None | Covariates | Covariates | Child covariate | Pregnancy | NA | NA | Child sex (female / male) | factor | None | Child sex | Sex |
| e3_yearbir_None | e3_yearbir_None | Covariates | Covariates | Child covariate | Pregnancy | NA | NA | Year of birth (2003 to 2009) | factor | None | Year of birth | YearBirth |
| hs_child_age_None | hs_child_age_None | Covariates | Covariates | Child covariate | Postnatal | NA | NA | Child age at examination (years) | numeric | None | Child age | cAge |
| hs_zbmi_who | hs_zbmi_who | Phenotype | Phenotype | Outcome at 6-11 years old | Postnatal | NA | NA | Body mass index z-score at 6-11 years old - WHO reference - Standardized on sex and age | numeric | None | Body mass index z-score | zBMI |
# specific lifestyle exposures
lifestyle_exposures <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
lifestyle_exposome <- dplyr::select(exposome, all_of(lifestyle_exposures))
summarytools::view(dfSummary(lifestyle_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | h_bfdur_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 2 | hs_bakery_prod_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 3 | hs_break_cer_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 4 | hs_dairy_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 5 | hs_fastfood_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 6 | hs_org_food_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 7 | hs_proc_meat_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 8 | hs_total_fish_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 9 | hs_total_fruits_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 10 | hs_total_lipids_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 11 | hs_total_sweets_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 12 | hs_total_veg_Ter [factor] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-23
categorical_lifestyle <- lifestyle_exposome %>%
dplyr::select(where(is.factor))
categorical_lifestyle_long <- pivot_longer(
categorical_lifestyle,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_categorical_vars <- unique(categorical_lifestyle_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
data <- filter(categorical_lifestyle_long, variable == var)
p <- ggplot(data, aes(x = value, fill = value)) +
geom_bar(stat = "count") +
labs(title = paste("Distribution of", var), x = var, y = "Count")
print(p)
return(p)
})
Breastfeeding Duration: Majority of observations are in the highest duration category, suggesting longer breastfeeding periods are common.
Bakery Products: Shows a relatively even distribution across the three categories, indicating varied consumption levels of bakery products among participants.
Breakfast Cereal: The highest category of cereal consumption is the most common, suggesting a preference for or greater consumption of cereals.
Dairy: Shows a fairly even distribution across all categories, indicating a uniform consumption pattern of dairy products.
Fast Food: Most participants fall into the middle category, indicating moderate consumption of fast food.
Organic Food: Most participants either consume a lot of or no organic food, with fewer in the middle range.
Processed Meat: Consumption levels are fairly evenly distributed, indicating varied dietary habits regarding processed meats.
Bread: Distribution shows a significant leaning towards higher bread consumption.
Cereal: Even distribution across categories suggests varied cereal consumption habits.
Fish and Seafood: Even distribution across categories, indicating varied consumption of fish and seafood.
Fruits: High fruit consumption is the most common, with fewer participants in the lowest category.
Added Fats: More participants consume added fats at the lowest and highest levels, with fewer in the middle.
Sweets: High consumption of sweets is the most common, indicating a preference for or higher access to sugary foods.
Vegetables: Most participants consume a high amount of vegetables.
# specific chemical exposures
chemical_exposures <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
chemical_exposome <- dplyr::select(exposome, all_of(chemical_exposures))
summarytools::view(dfSummary(chemical_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | hs_cd_c_Log2 [numeric] |
|
695 distinct values | 0 (0.0%) | |||||
| 2 | hs_co_c_Log2 [numeric] |
|
317 distinct values | 0 (0.0%) | |||||
| 3 | hs_cs_c_Log2 [numeric] |
|
369 distinct values | 0 (0.0%) | |||||
| 4 | hs_cu_c_Log2 [numeric] |
|
345 distinct values | 0 (0.0%) | |||||
| 5 | hs_hg_c_Log2 [numeric] |
|
698 distinct values | 0 (0.0%) | |||||
| 6 | hs_mo_c_Log2 [numeric] |
|
593 distinct values | 0 (0.0%) | |||||
| 7 | hs_pb_c_Log2 [numeric] |
|
529 distinct values | 0 (0.0%) | |||||
| 8 | hs_dde_cadj_Log2 [numeric] |
|
1050 distinct values | 0 (0.0%) | |||||
| 9 | hs_pcb153_cadj_Log2 [numeric] |
|
1047 distinct values | 0 (0.0%) | |||||
| 10 | hs_pcb170_cadj_Log2 [numeric] |
|
1039 distinct values | 0 (0.0%) | |||||
| 11 | hs_dep_cadj_Log2 [numeric] |
|
1045 distinct values | 0 (0.0%) | |||||
| 12 | hs_pbde153_cadj_Log2 [numeric] |
|
1036 distinct values | 0 (0.0%) | |||||
| 13 | hs_pfhxs_c_Log2 [numeric] |
|
1061 distinct values | 0 (0.0%) | |||||
| 14 | hs_pfoa_c_Log2 [numeric] |
|
1061 distinct values | 0 (0.0%) | |||||
| 15 | hs_pfos_c_Log2 [numeric] |
|
1050 distinct values | 0 (0.0%) | |||||
| 16 | hs_prpa_cadj_Log2 [numeric] |
|
1031 distinct values | 0 (0.0%) | |||||
| 17 | hs_mbzp_cadj_Log2 [numeric] |
|
1046 distinct values | 0 (0.0%) | |||||
| 18 | hs_mibp_cadj_Log2 [numeric] |
|
1057 distinct values | 0 (0.0%) | |||||
| 19 | hs_mnbp_cadj_Log2 [numeric] |
|
1048 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-23
#separate numeric and categorical data
numeric_chemical <- chemical_exposome %>%
dplyr::select(where(is.numeric))
numeric_chemical_long <- pivot_longer(
numeric_chemical,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_numerical_vars <- unique(numeric_chemical_long$variable)
num_plots <- lapply(unique_numerical_vars, function(var) {
data <- filter(numeric_chemical_long, variable == var)
p <- ggplot(data, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue") +
labs(title = paste("Histogram of", var), x = "Value", y = "Count")
print(p)
return(p)
})
Cadmium (hs_cd_c_Log2): The histogram for Cadmium exposure (hs_cd_c_Log2) shows a relatively symmetric distribution centered around 4 on the log2 scale. Most values range from approximately 3 to 5, with few outliers.
Cobalt (hs_co_c_Log2): The histogram of cobalt levels displays a roughly normal distribution centered around a slight positive skew, peaking around 3.5.
Cesium (hs_cs_c_Log2): Exhibits a right-skewed distribution, indicating that most participants have relatively low exposure levels, but a small number have substantially higher exposures. Majority of the data centered around 1.5 to 2.5
Copper (hs_cu_c_Log2): Shows a right-skewed distribution, suggesting that while most individuals have moderate exposure, a few experience significantly higher levels of copper.
Mercury (hs_hg_c_Log2): This distribution is also right-skewed, common for environmental pollutants, where a majority have lower exposure levels, and a minority have high exposure levels.
Molybdenum (hs_mo_c_Log2): Shows a distribution with a sharp peak and a long right tail, suggesting that while most people have similar exposure levels, a few have exceptionally high exposures. Has a sharp peak around 6, indicating that most values fall within a narrow range.
Lead (hs_pb_c_Log2): The distribution is slightly right-skewed, indicating higher exposure levels in a smaller group of the population compared to the majority.
DDE (hs_dde_cadj_Log2): Shows a pronounced right skew, typical for chemicals that accumulate in the environment and in human tissues, indicating higher levels of exposure in a smaller subset of the population..
PCB 153 (hs_pcb153_cadj_Log2): Has a distribution with right skewness, suggesting that exposure to these compounds is higher among a smaller segment of the population. Bimodal, indicating two peaks around 2 and 4.
PCB 170 (hs_pcb170_cadj_Log2): This histograms show a significant right skew, indicating lower concentrations of these chemicals in most samples, with fewer samples showing higher concentrations. This pattern suggests that while most individuals have low exposure, a few may have considerably higher levels.
DEP (hs_dep_cadj_Log2): DEP exposure has a sharp peak around 6, indicating a narrow distribution of values.
PBDE 153 (hs_pbde153_cadj_Log2): This histogram shows a bimodal distribution, with peaks around 1 and 4.
PFHxS: Distribution peaks around 5 with a broad spread, indicating variation in PFHxS levels.
PFOA: Shows a peak around 6 with a symmetrical distribution, indicating consistent PFOA levels.
PFOS: Similar to PFOA, the distribution peaks around 6, indicating consistent PFOS levels.
Propyl Paraben (hs_prpa_cadj_Log2): The distribution peaks around 6 with a broad spread, indicating variability in propyl paraben levels.
Monobenzyl Phthalate (hs_mbzp_cadj_Log2): This histogram shows a right-skewed distribution. Most values cluster at the lower end, indicating a common lower exposure level among subjects, with a long tail towards higher values suggesting occasional higher exposures. Shows a broad distribution with a peak around 4, indicating variation in monobenzyl phthalate levels. Indicates consistent but varied exposure levels.
Monoisobutyl Phthalate (hs_mibp_cadj_Log2): The distribution is right-skewed, similar to MBZP, but with a smoother decline. This pattern also indicates that while most subjects have lower exposure levels, a few experience significantly higher exposures.
Mono-n-butyl Phthalate (hs_mnbp_cadj_Log2): Peaks around 4, indicating consistent exposure levels with some variation. Few outliers are present.
numeric_chemical <- select_if(chemical_exposome, is.numeric)
cor_matrix <- cor(numeric_chemical, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.6)
Covariates were selected on its impact with the postnatal nature of the child. The only exception is sex and year of birth, which was considered as pregnancy. This were used since there are differences amongst gender as well as depending on the child’s age and when they are born.
# Specified covariates
specific_covariates <- c(
"e3_sex_None",
"e3_yearbir_None",
"hs_child_age_None"
)
covariate_data <- dplyr::select(covariates, all_of(specific_covariates))
summarytools::view(dfSummary(covariate_data, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | e3_sex_None [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||
| 2 | e3_yearbir_None [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||
| 3 | hs_child_age_None [numeric] |
|
879 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-23
#separate numeric and categorical data
numeric_covariates <- covariate_data %>%
dplyr::select(where(is.numeric))
numeric_covariates_long <- pivot_longer(
numeric_covariates,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_numerical_vars <- unique(numeric_covariates_long$variable)
num_plots <- lapply(unique_numerical_vars, function(var) {
data <- filter(numeric_covariates_long, variable == var)
p <- ggplot(data, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue") +
labs(title = paste("Histogram of", var), x = "Value", y = "Count")
print(p)
return(p)
})
Child’s Age (hs_child_age): This histogram is multimodal, reflecting several peaks across different ages. This could be indicative of the data collection points or particular age groups being studied.
categorical_covariates <- covariate_data %>%
dplyr::select(where(is.factor))
categorical_covariates_long <- pivot_longer(
categorical_covariates,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_categorical_vars <- unique(categorical_covariates_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
data <- filter(categorical_covariates_long, variable == var)
p <- ggplot(data, aes(x = value, fill = value)) +
geom_bar(stat = "count") +
labs(title = paste("Distribution of", var), x = var, y = "Count")
print(p)
return(p)
})
Cohorts (h_cohort): The distribution shows the count of subjects across six different cohorts. All cohorts have a substantial number of subjects, with cohort 5 showing the highest participation.
Gender Distribution (e3_sex): The gender distribution is nearly balanced with a slight higher count for males compared to females.
Year of Birth (e3_yearbir): This chart shows that the majority of subjects were born in the later years, with a significant increase in 2009, indicating perhaps a larger recruitment or a specific cohort focus that year.
outcome_BMI <- phenotype %>%
dplyr::select(hs_zbmi_who)
summarytools::view(dfSummary(outcome_BMI, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | hs_zbmi_who [numeric] |
|
421 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-23
combined_data <- cbind(covariate_data, lifestyle_exposome, chemical_exposome, outcome_BMI)
combined_data <- combined_data[, !duplicated(colnames(combined_data))]
# sex variable to a factor for stratification
combined_data$e3_sex_None <- as.factor(combined_data$e3_sex_None)
levels(combined_data$e3_sex_None) <- c("Male", "Female")
render_cont <- function(x) {
with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}
render_cat <- function(x) {
c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}
table1_formula <- ~
hs_child_age_None + e3_yearbir_None +
hs_zbmi_who +
h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
hs_proc_meat_Ter +
hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_sex_None
table1(
table1_formula,
data = combined_data,
render.continuous = render_cont,
render.categorical = render_cat,
overall = TRUE,
topclass = "Rtable1-shade"
)
| Male (N=608) |
Female (N=693) |
TRUE (N=1301) |
|
|---|---|---|---|
| hs_child_age_None | 7.91 (1.58) | 8.03 (1.64) | 7.98 (1.61) |
| e3_yearbir_None | |||
| 2003 | 25 (4.1 %) | 30 (4.3 %) | 55 (4.2 %) |
| 2004 | 46 (7.6 %) | 61 (8.8 %) | 107 (8.2 %) |
| 2005 | 121 (19.9 %) | 120 (17.3 %) | 241 (18.5 %) |
| 2006 | 108 (17.8 %) | 148 (21.4 %) | 256 (19.7 %) |
| 2007 | 128 (21.1 %) | 122 (17.6 %) | 250 (19.2 %) |
| 2008 | 177 (29.1 %) | 202 (29.1 %) | 379 (29.1 %) |
| 2009 | 3 (0.5 %) | 10 (1.4 %) | 13 (1.0 %) |
| hs_zbmi_who | 0.35 (1.15) | 0.45 (1.22) | 0.40 (1.19) |
| h_bfdur_Ter | |||
| (0,10.8] | 231 (38.0 %) | 275 (39.7 %) | 506 (38.9 %) |
| (10.8,34.9] | 118 (19.4 %) | 152 (21.9 %) | 270 (20.8 %) |
| (34.9,Inf] | 259 (42.6 %) | 266 (38.4 %) | 525 (40.4 %) |
| hs_bakery_prod_Ter | |||
| (0,2] | 164 (27.0 %) | 181 (26.1 %) | 345 (26.5 %) |
| (2,6] | 188 (30.9 %) | 235 (33.9 %) | 423 (32.5 %) |
| (6,Inf] | 256 (42.1 %) | 277 (40.0 %) | 533 (41.0 %) |
| hs_break_cer_Ter | |||
| (0,1.1] | 141 (23.2 %) | 150 (21.6 %) | 291 (22.4 %) |
| (1.1,5.5] | 251 (41.3 %) | 270 (39.0 %) | 521 (40.0 %) |
| (5.5,Inf] | 216 (35.5 %) | 273 (39.4 %) | 489 (37.6 %) |
| hs_dairy_Ter | |||
| (0,14.6] | 175 (28.8 %) | 184 (26.6 %) | 359 (27.6 %) |
| (14.6,25.6] | 229 (37.7 %) | 236 (34.1 %) | 465 (35.7 %) |
| (25.6,Inf] | 204 (33.6 %) | 273 (39.4 %) | 477 (36.7 %) |
| hs_fastfood_Ter | |||
| (0,0.132] | 75 (12.3 %) | 68 (9.8 %) | 143 (11.0 %) |
| (0.132,0.5] | 273 (44.9 %) | 330 (47.6 %) | 603 (46.3 %) |
| (0.5,Inf] | 260 (42.8 %) | 295 (42.6 %) | 555 (42.7 %) |
| hs_org_food_Ter | |||
| (0,0.132] | 211 (34.7 %) | 218 (31.5 %) | 429 (33.0 %) |
| (0.132,1] | 191 (31.4 %) | 205 (29.6 %) | 396 (30.4 %) |
| (1,Inf] | 206 (33.9 %) | 270 (39.0 %) | 476 (36.6 %) |
| hs_proc_meat_Ter | |||
| (0,1.5] | 175 (28.8 %) | 191 (27.6 %) | 366 (28.1 %) |
| (1.5,4] | 227 (37.3 %) | 244 (35.2 %) | 471 (36.2 %) |
| (4,Inf] | 206 (33.9 %) | 258 (37.2 %) | 464 (35.7 %) |
| hs_total_fish_Ter | |||
| (0,1.5] | 183 (30.1 %) | 206 (29.7 %) | 389 (29.9 %) |
| (1.5,3] | 224 (36.8 %) | 230 (33.2 %) | 454 (34.9 %) |
| (3,Inf] | 201 (33.1 %) | 257 (37.1 %) | 458 (35.2 %) |
| hs_total_fruits_Ter | |||
| (0,7] | 174 (28.6 %) | 239 (34.5 %) | 413 (31.7 %) |
| (7,14.1] | 216 (35.5 %) | 191 (27.6 %) | 407 (31.3 %) |
| (14.1,Inf] | 218 (35.9 %) | 263 (38.0 %) | 481 (37.0 %) |
| hs_total_lipids_Ter | |||
| (0,3] | 193 (31.7 %) | 204 (29.4 %) | 397 (30.5 %) |
| (3,7] | 171 (28.1 %) | 232 (33.5 %) | 403 (31.0 %) |
| (7,Inf] | 244 (40.1 %) | 257 (37.1 %) | 501 (38.5 %) |
| hs_total_sweets_Ter | |||
| (0,4.1] | 149 (24.5 %) | 195 (28.1 %) | 344 (26.4 %) |
| (4.1,8.5] | 251 (41.3 %) | 265 (38.2 %) | 516 (39.7 %) |
| (8.5,Inf] | 208 (34.2 %) | 233 (33.6 %) | 441 (33.9 %) |
| hs_total_veg_Ter | |||
| (0,6] | 190 (31.2 %) | 214 (30.9 %) | 404 (31.1 %) |
| (6,8.5] | 136 (22.4 %) | 178 (25.7 %) | 314 (24.1 %) |
| (8.5,Inf] | 282 (46.4 %) | 301 (43.4 %) | 583 (44.8 %) |
| hs_cd_c_Log2 | -3.99 (0.98) | -3.95 (1.09) | -3.97 (1.04) |
| hs_co_c_Log2 | -2.37 (0.61) | -2.32 (0.64) | -2.34 (0.63) |
| hs_cs_c_Log2 | 0.44 (0.58) | 0.44 (0.57) | 0.44 (0.57) |
| hs_cu_c_Log2 | 9.81 (0.25) | 9.84 (0.22) | 9.83 (0.23) |
| hs_hg_c_Log2 | -0.24 (1.59) | -0.35 (1.75) | -0.30 (1.68) |
| hs_mo_c_Log2 | -0.32 (0.83) | -0.31 (0.96) | -0.32 (0.90) |
| hs_dde_cadj_Log2 | 4.63 (1.48) | 4.70 (1.50) | 4.67 (1.49) |
| hs_pcb153_cadj_Log2 | 3.47 (0.86) | 3.63 (0.94) | 3.56 (0.90) |
| hs_pcb170_cadj_Log2 | -0.60 (3.22) | -0.05 (2.77) | -0.31 (3.00) |
| hs_dep_cadj_Log2 | 0.27 (3.16) | 0.06 (3.25) | 0.16 (3.21) |
| hs_pbde153_cadj_Log2 | -4.66 (3.86) | -4.40 (3.80) | -4.53 (3.83) |
| hs_pfhxs_c_Log2 | -1.62 (1.30) | -1.53 (1.31) | -1.57 (1.31) |
| hs_pfoa_c_Log2 | 0.60 (0.55) | 0.62 (0.56) | 0.61 (0.55) |
| hs_pfos_c_Log2 | 0.95 (1.15) | 0.99 (1.08) | 0.97 (1.11) |
| hs_prpa_cadj_Log2 | -1.26 (3.96) | -1.91 (3.68) | -1.61 (3.82) |
| hs_mbzp_cadj_Log2 | 2.42 (1.23) | 2.47 (1.22) | 2.44 (1.22) |
| hs_mibp_cadj_Log2 | 5.54 (1.09) | 5.39 (1.12) | 5.46 (1.11) |
| hs_mnbp_cadj_Log2 | 4.77 (1.08) | 4.60 (0.96) | 4.68 (1.02) |
combined_data$e3_yearbir_None <- as.factor(combined_data$e3_yearbir_None)
render_cont <- function(x) {
with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}
render_cat <- function(x) {
c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}
table1_formula_year <- ~
hs_child_age_None + e3_sex_None +
hs_zbmi_who +
h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
hs_proc_meat_Ter +
hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_yearbir_None
table1(
table1_formula_year,
data = combined_data,
render.continuous = render_cont,
render.categorical = render_cat,
overall = TRUE,
topclass = "Rtable1-shade"
)
| 2003 (N=55) |
2004 (N=107) |
2005 (N=241) |
2006 (N=256) |
2007 (N=250) |
2008 (N=379) |
2009 (N=13) |
TRUE (N=1301) |
|
|---|---|---|---|---|---|---|---|---|
| hs_child_age_None | 11.32 (0.47) | 10.81 (0.38) | 9.19 (0.57) | 8.38 (0.41) | 6.91 (0.51) | 6.41 (0.31) | 6.18 (0.15) | 7.98 (1.61) |
| e3_sex_None | ||||||||
| Male | 25 (45.5 %) | 46 (43.0 %) | 121 (50.2 %) | 108 (42.2 %) | 128 (51.2 %) | 177 (46.7 %) | 3 (23.1 %) | 608 (46.7 %) |
| Female | 30 (54.5 %) | 61 (57.0 %) | 120 (49.8 %) | 148 (57.8 %) | 122 (48.8 %) | 202 (53.3 %) | 10 (76.9 %) | 693 (53.3 %) |
| hs_zbmi_who | 0.32 (1.15) | 0.12 (1.17) | 0.45 (1.11) | 0.34 (1.11) | 0.41 (1.22) | 0.49 (1.28) | 0.79 (1.14) | 0.40 (1.19) |
| h_bfdur_Ter | ||||||||
| (0,10.8] | 35 (63.6 %) | 65 (60.7 %) | 92 (38.2 %) | 81 (31.6 %) | 90 (36.0 %) | 138 (36.4 %) | 5 (38.5 %) | 506 (38.9 %) |
| (10.8,34.9] | 15 (27.3 %) | 28 (26.2 %) | 69 (28.6 %) | 42 (16.4 %) | 38 (15.2 %) | 75 (19.8 %) | 3 (23.1 %) | 270 (20.8 %) |
| (34.9,Inf] | 5 (9.1 %) | 14 (13.1 %) | 80 (33.2 %) | 133 (52.0 %) | 122 (48.8 %) | 166 (43.8 %) | 5 (38.5 %) | 525 (40.4 %) |
| hs_bakery_prod_Ter | ||||||||
| (0,2] | 16 (29.1 %) | 21 (19.6 %) | 88 (36.5 %) | 117 (45.7 %) | 48 (19.2 %) | 53 (14.0 %) | 2 (15.4 %) | 345 (26.5 %) |
| (2,6] | 12 (21.8 %) | 30 (28.0 %) | 75 (31.1 %) | 90 (35.2 %) | 77 (30.8 %) | 132 (34.8 %) | 7 (53.8 %) | 423 (32.5 %) |
| (6,Inf] | 27 (49.1 %) | 56 (52.3 %) | 78 (32.4 %) | 49 (19.1 %) | 125 (50.0 %) | 194 (51.2 %) | 4 (30.8 %) | 533 (41.0 %) |
| hs_break_cer_Ter | ||||||||
| (0,1.1] | 16 (29.1 %) | 38 (35.5 %) | 60 (24.9 %) | 62 (24.2 %) | 39 (15.6 %) | 70 (18.5 %) | 6 (46.2 %) | 291 (22.4 %) |
| (1.1,5.5] | 19 (34.5 %) | 36 (33.6 %) | 103 (42.7 %) | 103 (40.2 %) | 103 (41.2 %) | 153 (40.4 %) | 4 (30.8 %) | 521 (40.0 %) |
| (5.5,Inf] | 20 (36.4 %) | 33 (30.8 %) | 78 (32.4 %) | 91 (35.5 %) | 108 (43.2 %) | 156 (41.2 %) | 3 (23.1 %) | 489 (37.6 %) |
| hs_dairy_Ter | ||||||||
| (0,14.6] | 15 (27.3 %) | 21 (19.6 %) | 67 (27.8 %) | 58 (22.7 %) | 73 (29.2 %) | 119 (31.4 %) | 6 (46.2 %) | 359 (27.6 %) |
| (14.6,25.6] | 12 (21.8 %) | 27 (25.2 %) | 90 (37.3 %) | 101 (39.5 %) | 80 (32.0 %) | 149 (39.3 %) | 6 (46.2 %) | 465 (35.7 %) |
| (25.6,Inf] | 28 (50.9 %) | 59 (55.1 %) | 84 (34.9 %) | 97 (37.9 %) | 97 (38.8 %) | 111 (29.3 %) | 1 (7.7 %) | 477 (36.7 %) |
| hs_fastfood_Ter | ||||||||
| (0,0.132] | 6 (10.9 %) | 14 (13.1 %) | 20 (8.3 %) | 21 (8.2 %) | 22 (8.8 %) | 57 (15.0 %) | 3 (23.1 %) | 143 (11.0 %) |
| (0.132,0.5] | 38 (69.1 %) | 45 (42.1 %) | 140 (58.1 %) | 153 (59.8 %) | 94 (37.6 %) | 129 (34.0 %) | 4 (30.8 %) | 603 (46.3 %) |
| (0.5,Inf] | 11 (20.0 %) | 48 (44.9 %) | 81 (33.6 %) | 82 (32.0 %) | 134 (53.6 %) | 193 (50.9 %) | 6 (46.2 %) | 555 (42.7 %) |
| hs_org_food_Ter | ||||||||
| (0,0.132] | 17 (30.9 %) | 30 (28.0 %) | 77 (32.0 %) | 51 (19.9 %) | 96 (38.4 %) | 155 (40.9 %) | 3 (23.1 %) | 429 (33.0 %) |
| (0.132,1] | 20 (36.4 %) | 39 (36.4 %) | 78 (32.4 %) | 99 (38.7 %) | 68 (27.2 %) | 87 (23.0 %) | 5 (38.5 %) | 396 (30.4 %) |
| (1,Inf] | 18 (32.7 %) | 38 (35.5 %) | 86 (35.7 %) | 106 (41.4 %) | 86 (34.4 %) | 137 (36.1 %) | 5 (38.5 %) | 476 (36.6 %) |
| hs_proc_meat_Ter | ||||||||
| (0,1.5] | 6 (10.9 %) | 26 (24.3 %) | 38 (15.8 %) | 37 (14.5 %) | 91 (36.4 %) | 162 (42.7 %) | 6 (46.2 %) | 366 (28.1 %) |
| (1.5,4] | 36 (65.5 %) | 41 (38.3 %) | 80 (33.2 %) | 92 (35.9 %) | 83 (33.2 %) | 134 (35.4 %) | 5 (38.5 %) | 471 (36.2 %) |
| (4,Inf] | 13 (23.6 %) | 40 (37.4 %) | 123 (51.0 %) | 127 (49.6 %) | 76 (30.4 %) | 83 (21.9 %) | 2 (15.4 %) | 464 (35.7 %) |
| hs_total_fish_Ter | ||||||||
| (0,1.5] | 11 (20.0 %) | 20 (18.7 %) | 32 (13.3 %) | 36 (14.1 %) | 106 (42.4 %) | 180 (47.5 %) | 4 (30.8 %) | 389 (29.9 %) |
| (1.5,3] | 31 (56.4 %) | 56 (52.3 %) | 80 (33.2 %) | 70 (27.3 %) | 82 (32.8 %) | 128 (33.8 %) | 7 (53.8 %) | 454 (34.9 %) |
| (3,Inf] | 13 (23.6 %) | 31 (29.0 %) | 129 (53.5 %) | 150 (58.6 %) | 62 (24.8 %) | 71 (18.7 %) | 2 (15.4 %) | 458 (35.2 %) |
| hs_total_fruits_Ter | ||||||||
| (0,7] | 33 (60.0 %) | 58 (54.2 %) | 73 (30.3 %) | 55 (21.5 %) | 68 (27.2 %) | 119 (31.4 %) | 7 (53.8 %) | 413 (31.7 %) |
| (7,14.1] | 13 (23.6 %) | 24 (22.4 %) | 88 (36.5 %) | 76 (29.7 %) | 81 (32.4 %) | 123 (32.5 %) | 2 (15.4 %) | 407 (31.3 %) |
| (14.1,Inf] | 9 (16.4 %) | 25 (23.4 %) | 80 (33.2 %) | 125 (48.8 %) | 101 (40.4 %) | 137 (36.1 %) | 4 (30.8 %) | 481 (37.0 %) |
| hs_total_lipids_Ter | ||||||||
| (0,3] | 9 (16.4 %) | 18 (16.8 %) | 97 (40.2 %) | 82 (32.0 %) | 67 (26.8 %) | 122 (32.2 %) | 2 (15.4 %) | 397 (30.5 %) |
| (3,7] | 26 (47.3 %) | 45 (42.1 %) | 71 (29.5 %) | 59 (23.0 %) | 80 (32.0 %) | 116 (30.6 %) | 6 (46.2 %) | 403 (31.0 %) |
| (7,Inf] | 20 (36.4 %) | 44 (41.1 %) | 73 (30.3 %) | 115 (44.9 %) | 103 (41.2 %) | 141 (37.2 %) | 5 (38.5 %) | 501 (38.5 %) |
| hs_total_sweets_Ter | ||||||||
| (0,4.1] | 16 (29.1 %) | 17 (15.9 %) | 87 (36.1 %) | 96 (37.5 %) | 51 (20.4 %) | 74 (19.5 %) | 3 (23.1 %) | 344 (26.4 %) |
| (4.1,8.5] | 13 (23.6 %) | 38 (35.5 %) | 99 (41.1 %) | 103 (40.2 %) | 104 (41.6 %) | 157 (41.4 %) | 2 (15.4 %) | 516 (39.7 %) |
| (8.5,Inf] | 26 (47.3 %) | 52 (48.6 %) | 55 (22.8 %) | 57 (22.3 %) | 95 (38.0 %) | 148 (39.1 %) | 8 (61.5 %) | 441 (33.9 %) |
| hs_total_veg_Ter | ||||||||
| (0,6] | 19 (34.5 %) | 26 (24.3 %) | 75 (31.1 %) | 63 (24.6 %) | 81 (32.4 %) | 136 (35.9 %) | 4 (30.8 %) | 404 (31.1 %) |
| (6,8.5] | 11 (20.0 %) | 25 (23.4 %) | 53 (22.0 %) | 71 (27.7 %) | 55 (22.0 %) | 95 (25.1 %) | 4 (30.8 %) | 314 (24.1 %) |
| (8.5,Inf] | 25 (45.5 %) | 56 (52.3 %) | 113 (46.9 %) | 122 (47.7 %) | 114 (45.6 %) | 148 (39.1 %) | 5 (38.5 %) | 583 (44.8 %) |
| hs_cd_c_Log2 | -4.32 (1.45) | -3.93 (1.01) | -3.90 (1.15) | -3.91 (1.00) | -3.90 (0.88) | -4.05 (0.97) | -4.09 (1.85) | -3.97 (1.04) |
| hs_co_c_Log2 | -2.35 (0.51) | -2.38 (0.59) | -2.54 (0.60) | -2.45 (0.68) | -2.27 (0.58) | -2.20 (0.62) | -2.11 (0.55) | -2.34 (0.63) |
| hs_cs_c_Log2 | 1.04 (0.47) | 1.03 (0.50) | 0.71 (0.43) | 0.65 (0.43) | 0.18 (0.50) | 0.07 (0.45) | -0.04 (0.40) | 0.44 (0.57) |
| hs_cu_c_Log2 | 9.83 (0.18) | 9.89 (0.30) | 9.79 (0.21) | 9.76 (0.22) | 9.82 (0.22) | 9.88 (0.23) | 9.86 (0.26) | 9.83 (0.23) |
| hs_hg_c_Log2 | 0.56 (1.21) | 0.73 (1.27) | 0.41 (1.35) | 0.12 (1.35) | -0.78 (1.73) | -1.11 (1.70) | -0.75 (1.41) | -0.30 (1.68) |
| hs_mo_c_Log2 | -0.85 (1.80) | -0.43 (0.64) | -0.45 (0.83) | -0.32 (0.81) | -0.26 (0.83) | -0.16 (0.88) | -0.27 (0.94) | -0.32 (0.90) |
| hs_dde_cadj_Log2 | 4.06 (1.34) | 3.90 (1.09) | 4.23 (1.24) | 4.32 (1.02) | 4.96 (1.66) | 5.28 (1.62) | 5.16 (1.23) | 4.67 (1.49) |
| hs_pcb153_cadj_Log2 | 3.54 (0.73) | 3.47 (0.72) | 3.80 (0.84) | 4.03 (0.79) | 3.32 (0.95) | 3.25 (0.87) | 3.69 (0.96) | 3.56 (0.90) |
| hs_pcb170_cadj_Log2 | 0.48 (1.19) | 0.17 (2.28) | 0.64 (2.35) | 1.08 (1.77) | -1.46 (3.58) | -1.32 (3.29) | -0.92 (3.65) | -0.31 (3.00) |
| hs_dep_cadj_Log2 | -0.62 (3.08) | 0.10 (3.10) | 0.13 (3.31) | 0.19 (2.90) | 0.65 (3.10) | -0.02 (3.42) | -0.25 (3.57) | 0.16 (3.21) |
| hs_pbde153_cadj_Log2 | -4.43 (3.35) | -5.26 (3.63) | -4.26 (3.77) | -3.68 (3.60) | -4.30 (3.95) | -5.20 (3.93) | -5.08 (3.90) | -4.53 (3.83) |
| hs_pfhxs_c_Log2 | -0.58 (0.66) | -0.53 (0.85) | -1.01 (0.99) | -1.07 (0.88) | -2.03 (1.40) | -2.37 (1.22) | -2.68 (0.71) | -1.57 (1.31) |
| hs_pfoa_c_Log2 | 0.53 (0.44) | 0.55 (0.53) | 0.67 (0.49) | 0.64 (0.51) | 0.66 (0.63) | 0.55 (0.58) | 0.35 (0.50) | 0.61 (0.55) |
| hs_pfos_c_Log2 | 1.67 (0.68) | 1.55 (0.72) | 1.18 (1.00) | 1.10 (1.14) | 0.80 (1.05) | 0.62 (1.17) | 0.40 (0.94) | 0.97 (1.11) |
| hs_prpa_cadj_Log2 | -3.21 (3.68) | -2.67 (3.04) | -0.92 (3.92) | -1.65 (3.89) | -1.57 (3.80) | -1.58 (3.80) | 0.87 (4.50) | -1.61 (3.82) |
| hs_mbzp_cadj_Log2 | 2.69 (1.08) | 2.84 (1.11) | 2.43 (1.19) | 2.32 (1.14) | 2.36 (1.34) | 2.45 (1.24) | 2.21 (1.51) | 2.44 (1.22) |
| hs_mibp_cadj_Log2 | 5.19 (1.08) | 5.59 (1.07) | 4.84 (0.97) | 4.82 (0.98) | 5.84 (1.00) | 6.02 (0.93) | 6.27 (0.97) | 5.46 (1.11) |
| hs_mnbp_cadj_Log2 | 3.99 (0.86) | 4.34 (0.88) | 4.26 (0.90) | 4.52 (0.90) | 4.85 (0.95) | 5.11 (1.05) | 5.22 (0.77) | 4.68 (1.02) |
This will look into the full data provided by the diet and chemical variables. Regression models will look into which variables are important for analysis.
outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
#the full chemicals list
chemicals_full <- c(
"hs_as_c_Log2",
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mn_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_tl_cdich_None",
"hs_dde_cadj_Log2",
"hs_ddt_cadj_Log2",
"hs_hcb_cadj_Log2",
"hs_pcb118_cadj_Log2",
"hs_pcb138_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_pcb180_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_dmdtp_cdich_None",
"hs_dmp_cadj_Log2",
"hs_dmtp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pbde47_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfna_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_pfunda_c_Log2",
"hs_bpa_cadj_Log2",
"hs_bupa_cadj_Log2",
"hs_etpa_cadj_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_trcs_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mecpp_cadj_Log2",
"hs_mehhp_cadj_Log2",
"hs_mehp_cadj_Log2",
"hs_meohp_cadj_Log2",
"hs_mep_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2",
"hs_ohminp_cadj_Log2",
"hs_oxominp_cadj_Log2",
"hs_cotinine_cdich_None",
"hs_globalexp2_None"
)
#postnatal diet for child
postnatal_diet <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_beverages_Ter",
"hs_break_cer_Ter",
"hs_caff_drink_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_readymade_Ter",
"hs_total_bread_Ter",
"hs_total_cereal_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_meat_Ter",
"hs_total_potatoes_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter",
"hs_total_yog_Ter"
)
chemicals_columns <- c(chemicals_full)
all_chemicals <- exposome %>% dplyr::select(all_of(chemicals_columns))
diet_columns <- c(postnatal_diet)
all_diet <- exposome %>% dplyr::select(all_of(diet_columns))
all_columns <- c(chemicals_full, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))
chemicals_outcome_cov <- cbind(outcome_cov, all_chemicals)
diet_outcome_cov <- cbind(outcome_cov, all_diet)
interested_data <- cbind(outcome_cov, extracted_exposome)
head(interested_data)
interested_data_corr <- select_if(interested_data, is.numeric)
cor_matrix <- cor(interested_data_corr, method = "pearson")
cor_matrix <- cor(interested_data_corr, method = "spearman")
cor_matrix <- cor(interested_data_corr, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.4)
covariates_columns <- colnames(covariate_data)
covariates_data <- interested_data %>% dplyr::select(all_of(covariates_columns))
y <- interested_data$hs_zbmi_who
set.seed(101)
trainIndex <- createDataPartition(y, p = .7, list = FALSE, times = 1)
train_data <- interested_data[trainIndex,]
test_data <- interested_data[-trainIndex,]
x_train_cov <- model.matrix(~ . -1, data = covariates_data[trainIndex,])
x_test_cov <- model.matrix(~ . -1, data = covariates_data[-trainIndex,])
y_train <- train_data$hs_zbmi_who
y_test <- test_data$hs_zbmi_who
fit_lasso_cov <- cv.glmnet(x_train_cov, y_train, alpha = 1, family = "gaussian")
plot(fit_lasso_cov, xvar = "lambda", main = "Lasso Coefficients Path")
best_lambda_lasso <- fit_lasso_cov$lambda.min
coef(fit_lasso_cov, s = best_lambda_lasso)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.413483
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## hs_child_age_None .
lasso_predictions_cov <- predict(fit_lasso_cov, s = best_lambda_lasso, newx = x_test_cov)
mse_lasso_cov <- mean((y_test - lasso_predictions_cov)^2)
rmse_lasso_cov <- sqrt(mse_lasso_cov)
cat("Lasso Test MSE (Covariates only):", mse_lasso_cov, "\n")
## Lasso Test MSE (Covariates only): 1.319431
cat("Lasso Test RMSE (Covariates only):", rmse_lasso_cov, "\n")
## Lasso Test RMSE (Covariates only): 1.148665
fit_ridge_cov <- cv.glmnet(x_train_cov, y_train, alpha = 0, family = "gaussian")
plot(fit_ridge_cov, xvar = "lambda", main = "Ridge Coefficients Path")
best_lambda_ridge <- fit_ridge_cov$lambda.min
coef(fit_ridge_cov, s = best_lambda_ridge)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.469753606
## e3_sex_Nonefemale -0.004014094
## e3_sex_Nonemale 0.004014005
## e3_yearbir_None2004 -0.045803833
## e3_yearbir_None2005 -0.006042782
## e3_yearbir_None2006 -0.004660190
## e3_yearbir_None2007 0.001372564
## e3_yearbir_None2008 0.020817988
## e3_yearbir_None2009 0.072947788
## hs_child_age_None -0.007171273
ridge_predictions_cov <- predict(fit_ridge_cov, s = best_lambda_ridge, newx = x_test_cov)
mse_ridge_cov <- mean((y_test - ridge_predictions_cov)^2)
rmse_ridge_cov <- sqrt(mse_ridge_cov)
cat("Ridge Test MSE (Covariates only):", mse_ridge_cov, "\n")
## Ridge Test MSE (Covariates only): 1.317463
cat("Ridge Test RMSE (Covariates only):", rmse_ridge_cov, "\n")
## Ridge Test RMSE (Covariates only): 1.147808
fit_enet_cov <- cv.glmnet(x_train_cov, y_train, alpha = 0.5, family = "gaussian")
plot(fit_enet_cov, xvar = "lambda", main = "Elastic Net Coefficients Path")
best_lambda_enet <- fit_enet_cov$lambda.min
coef(fit_enet_cov, s = best_lambda_enet)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.54361182
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 -0.15055331
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 0.03548813
## e3_yearbir_None2009 0.11072642
## hs_child_age_None -0.01602837
enet_predictions_cov <- predict(fit_enet_cov, s = best_lambda_enet, newx = x_test_cov)
mse_enet_cov <- mean((y_test - enet_predictions_cov)^2)
rmse_enet_cov <- sqrt(mse_enet_cov)
cat("Elastic Net Test MSE (Covariates only):", mse_enet_cov, "\n")
## Elastic Net Test MSE (Covariates only): 1.318349
cat("Elastic Net Test RMSE (Covariates only):", rmse_enet_cov, "\n")
## Elastic Net Test RMSE (Covariates only): 1.148194
Chemicals will be analyzed for the best variables using the regression methods.
#LASSO train/test 70-30
set.seed(101)
train_indices <- sample(seq_len(nrow(chemicals_outcome_cov)), size = floor(0.7 * nrow(interested_data)))
test_indices <- setdiff(seq_len(nrow(chemicals_outcome_cov)), train_indices)
x_train <- as.matrix(chemicals_outcome_cov[train_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_train <- chemicals_outcome_cov$hs_zbmi_who[train_indices]
x_test <- as.matrix(chemicals_outcome_cov[test_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_test <- chemicals_outcome_cov$hs_zbmi_who[test_indices]
x_train_chemicals_only <- as.matrix(chemicals_outcome_cov[train_indices, chemicals_full])
x_test_chemicals_only <- as.matrix(chemicals_outcome_cov[test_indices, chemicals_full])
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 1, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)
plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")
best_lambda <- fit_without_covariates_train$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.7797230131
## hs_as_c_Log2 .
## hs_cd_c_Log2 -0.0238815730
## hs_co_c_Log2 -0.0011670319
## hs_cs_c_Log2 0.0771865955
## hs_cu_c_Log2 0.6071183261
## hs_hg_c_Log2 -0.0075730086
## hs_mn_c_Log2 .
## hs_mo_c_Log2 -0.0992489424
## hs_pb_c_Log2 -0.0056257448
## hs_tl_cdich_None .
## hs_dde_cadj_Log2 -0.0378984008
## hs_ddt_cadj_Log2 .
## hs_hcb_cadj_Log2 .
## hs_pcb118_cadj_Log2 .
## hs_pcb138_cadj_Log2 .
## hs_pcb153_cadj_Log2 -0.1721262187
## hs_pcb170_cadj_Log2 -0.0557570999
## hs_pcb180_cadj_Log2 .
## hs_dep_cadj_Log2 -0.0186165147
## hs_detp_cadj_Log2 .
## hs_dmdtp_cdich_None .
## hs_dmp_cadj_Log2 .
## hs_dmtp_cadj_Log2 .
## hs_pbde153_cadj_Log2 -0.0357794002
## hs_pbde47_cadj_Log2 .
## hs_pfhxs_c_Log2 -0.0019079468
## hs_pfna_c_Log2 .
## hs_pfoa_c_Log2 -0.1360824261
## hs_pfos_c_Log2 -0.0478302901
## hs_pfunda_c_Log2 .
## hs_bpa_cadj_Log2 .
## hs_bupa_cadj_Log2 .
## hs_etpa_cadj_Log2 .
## hs_mepa_cadj_Log2 .
## hs_oxbe_cadj_Log2 0.0008622765
## hs_prpa_cadj_Log2 0.0011728557
## hs_trcs_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.0373221816
## hs_mecpp_cadj_Log2 .
## hs_mehhp_cadj_Log2 .
## hs_mehp_cadj_Log2 .
## hs_meohp_cadj_Log2 .
## hs_mep_cadj_Log2 .
## hs_mibp_cadj_Log2 -0.0477304169
## hs_mnbp_cadj_Log2 -0.0036235331
## hs_ohminp_cadj_Log2 .
## hs_oxominp_cadj_Log2 .
## hs_cotinine_cdich_None .
## hs_globalexp2_None .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.231997
# RIDGE
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)
plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")
best_lambda <- fit_without_covariates_train$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.469806e+00
## hs_as_c_Log2 6.590433e-03
## hs_cd_c_Log2 -4.093355e-02
## hs_co_c_Log2 -5.049922e-02
## hs_cs_c_Log2 1.230373e-01
## hs_cu_c_Log2 6.078479e-01
## hs_hg_c_Log2 -3.225520e-02
## hs_mn_c_Log2 -3.089195e-02
## hs_mo_c_Log2 -1.068154e-01
## hs_pb_c_Log2 -5.295956e-02
## hs_tl_cdich_None .
## hs_dde_cadj_Log2 -4.888006e-02
## hs_ddt_cadj_Log2 4.045085e-03
## hs_hcb_cadj_Log2 -1.857150e-02
## hs_pcb118_cadj_Log2 1.400112e-02
## hs_pcb138_cadj_Log2 -3.614513e-02
## hs_pcb153_cadj_Log2 -1.223407e-01
## hs_pcb170_cadj_Log2 -5.267521e-02
## hs_pcb180_cadj_Log2 -1.074695e-02
## hs_dep_cadj_Log2 -2.548881e-02
## hs_detp_cadj_Log2 8.051621e-03
## hs_dmdtp_cdich_None .
## hs_dmp_cadj_Log2 -2.097690e-03
## hs_dmtp_cadj_Log2 7.300567e-05
## hs_pbde153_cadj_Log2 -3.315313e-02
## hs_pbde47_cadj_Log2 5.273953e-03
## hs_pfhxs_c_Log2 -2.966308e-02
## hs_pfna_c_Log2 2.336166e-02
## hs_pfoa_c_Log2 -1.519872e-01
## hs_pfos_c_Log2 -6.495855e-02
## hs_pfunda_c_Log2 1.248503e-02
## hs_bpa_cadj_Log2 3.832688e-04
## hs_bupa_cadj_Log2 6.588467e-03
## hs_etpa_cadj_Log2 -6.098679e-03
## hs_mepa_cadj_Log2 -1.638466e-02
## hs_oxbe_cadj_Log2 1.390524e-02
## hs_prpa_cadj_Log2 1.258510e-02
## hs_trcs_cadj_Log2 2.878805e-03
## hs_mbzp_cadj_Log2 5.550048e-02
## hs_mecpp_cadj_Log2 1.627174e-03
## hs_mehhp_cadj_Log2 2.316991e-02
## hs_mehp_cadj_Log2 -1.662304e-02
## hs_meohp_cadj_Log2 1.137436e-02
## hs_mep_cadj_Log2 3.371106e-03
## hs_mibp_cadj_Log2 -5.391219e-02
## hs_mnbp_cadj_Log2 -4.383016e-02
## hs_ohminp_cadj_Log2 -2.886768e-02
## hs_oxominp_cadj_Log2 2.204660e-02
## hs_cotinine_cdich_None .
## hs_globalexp2_None .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.188752
# ELASTIC NET
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)
plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")
best_lambda <- fit_without_covariates_train$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.785950188
## hs_as_c_Log2 .
## hs_cd_c_Log2 -0.025843356
## hs_co_c_Log2 -0.005835867
## hs_cs_c_Log2 0.084715330
## hs_cu_c_Log2 0.607379616
## hs_hg_c_Log2 -0.009800093
## hs_mn_c_Log2 .
## hs_mo_c_Log2 -0.099724922
## hs_pb_c_Log2 -0.010318890
## hs_tl_cdich_None .
## hs_dde_cadj_Log2 -0.039528137
## hs_ddt_cadj_Log2 .
## hs_hcb_cadj_Log2 .
## hs_pcb118_cadj_Log2 .
## hs_pcb138_cadj_Log2 .
## hs_pcb153_cadj_Log2 -0.169008355
## hs_pcb170_cadj_Log2 -0.055808065
## hs_pcb180_cadj_Log2 .
## hs_dep_cadj_Log2 -0.019034348
## hs_detp_cadj_Log2 .
## hs_dmdtp_cdich_None .
## hs_dmp_cadj_Log2 .
## hs_dmtp_cadj_Log2 .
## hs_pbde153_cadj_Log2 -0.035464586
## hs_pbde47_cadj_Log2 .
## hs_pfhxs_c_Log2 -0.006816020
## hs_pfna_c_Log2 .
## hs_pfoa_c_Log2 -0.135997766
## hs_pfos_c_Log2 -0.047692264
## hs_pfunda_c_Log2 .
## hs_bpa_cadj_Log2 .
## hs_bupa_cadj_Log2 .
## hs_etpa_cadj_Log2 .
## hs_mepa_cadj_Log2 .
## hs_oxbe_cadj_Log2 0.002529961
## hs_prpa_cadj_Log2 0.001735800
## hs_trcs_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.040317847
## hs_mecpp_cadj_Log2 .
## hs_mehhp_cadj_Log2 .
## hs_mehp_cadj_Log2 .
## hs_meohp_cadj_Log2 .
## hs_mep_cadj_Log2 .
## hs_mibp_cadj_Log2 -0.047892677
## hs_mnbp_cadj_Log2 -0.008483913
## hs_ohminp_cadj_Log2 .
## hs_oxominp_cadj_Log2 .
## hs_cotinine_cdich_None .
## hs_globalexp2_None .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.228805
#selected chemicals that were noted in enet
chemicals_selected <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2")
The features for chemicals were selected due to the feature selections of elastic net. LASSO might simplify the dimensionality, so elastic net was chosen since feature importance is uncertain.
Like with the chemical variables, the postnatal diet of children will be analyzed for the best variables using the regression methods.
# LASSO with train/test
set.seed(101)
train_indices <- sample(seq_len(nrow(diet_outcome_cov)), size = floor(0.7 * nrow(diet_outcome_cov)))
test_indices <- setdiff(seq_len(nrow(diet_outcome_cov)), train_indices)
diet_data <- diet_outcome_cov[, postnatal_diet]
x_diet_train <- model.matrix(~ . + 0, data = diet_data[train_indices, ])
x_diet_test <- model.matrix(~ . + 0, data = diet_data[test_indices, ])
covariates <- diet_outcome_cov[, c("e3_sex_None", "e3_yearbir_None", "hs_child_age_None")]
x_covariates_train <- model.matrix(~ . + 0, data = covariates[train_indices, ])
x_covariates_test <- model.matrix(~ . + 0, data = covariates[test_indices, ])
x_full_train <- cbind(x_diet_train, x_covariates_train)
x_full_test <- cbind(x_diet_test, x_covariates_test)
x_full_train[is.na(x_full_train)] <- 0
x_full_test[is.na(x_full_test)] <- 0
x_diet_train[is.na(x_diet_train)] <- 0
x_diet_test[is.na(x_diet_test)] <- 0
y_train <- as.numeric(diet_outcome_cov$hs_zbmi_who[train_indices])
y_test <- as.numeric(diet_outcome_cov$hs_zbmi_who[test_indices])
# fit models
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 1, family = "gaussian")
fit_without_covariates
##
## Call: cv.glmnet(x = x_diet_train, y = y_train, alpha = 1, family = "gaussian")
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.06922 9 1.431 0.06022 5
## 1se 0.14570 1 1.442 0.06160 0
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")
best_lambda <- fit_without_covariates$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.53256344
## h_bfdur_Ter(0,10.8] .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_beverages_Ter(0.132,1] .
## hs_beverages_Ter(1,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_caff_drink_Ter(0.132,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] -0.13588632
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_readymade_Ter(0.132,0.5] .
## hs_readymade_Ter(0.5,Inf] .
## hs_total_bread_Ter(7,17.5] .
## hs_total_bread_Ter(17.5,Inf] .
## hs_total_cereal_Ter(14.1,23.6] .
## hs_total_cereal_Ter(23.6,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] -0.02481964
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] -0.05164312
## hs_total_meat_Ter(6,9] .
## hs_total_meat_Ter(9,Inf] .
## hs_total_potatoes_Ter(3,4] .
## hs_total_potatoes_Ter(4,Inf] .
## hs_total_sweets_Ter(4.1,8.5] -0.01594403
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] -0.08180563
## hs_total_yog_Ter(6,8.5] .
## hs_total_yog_Ter(8.5,Inf] .
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.34942
# RIDGE
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0, family = "gaussian")
fit_without_covariates
##
## Call: cv.glmnet(x = x_diet_train, y = y_train, alpha = 0, family = "gaussian")
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 3.53 41 1.431 0.08497 40
## 1se 145.70 1 1.441 0.08233 40
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")
best_lambda <- fit_without_covariates$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.5163069457
## h_bfdur_Ter(0,10.8] -0.0114164662
## h_bfdur_Ter(10.8,34.9] 0.0353770607
## h_bfdur_Ter(34.9,Inf] -0.0138651651
## hs_bakery_prod_Ter(2,6] 0.0228606785
## hs_bakery_prod_Ter(6,Inf] -0.0268639952
## hs_beverages_Ter(0.132,1] -0.0065939314
## hs_beverages_Ter(1,Inf] -0.0016124215
## hs_break_cer_Ter(1.1,5.5] -0.0034207548
## hs_break_cer_Ter(5.5,Inf] -0.0337182186
## hs_caff_drink_Ter(0.132,Inf] -0.0143879393
## hs_dairy_Ter(14.6,25.6] 0.0355023507
## hs_dairy_Ter(25.6,Inf] -0.0005581647
## hs_fastfood_Ter(0.132,0.5] 0.0161761119
## hs_fastfood_Ter(0.5,Inf] -0.0001750742
## hs_org_food_Ter(0.132,1] 0.0151677373
## hs_org_food_Ter(1,Inf] -0.0682466785
## hs_proc_meat_Ter(1.5,4] 0.0222199344
## hs_proc_meat_Ter(4,Inf] -0.0187135643
## hs_readymade_Ter(0.132,0.5] -0.0013536008
## hs_readymade_Ter(0.5,Inf] 0.0105115509
## hs_total_bread_Ter(7,17.5] -0.0035702530
## hs_total_bread_Ter(17.5,Inf] -0.0070550360
## hs_total_cereal_Ter(14.1,23.6] 0.0082269928
## hs_total_cereal_Ter(23.6,Inf] -0.0131001584
## hs_total_fish_Ter(1.5,3] -0.0346609367
## hs_total_fish_Ter(3,Inf] -0.0051749487
## hs_total_fruits_Ter(7,14.1] 0.0266413533
## hs_total_fruits_Ter(14.1,Inf] -0.0389551124
## hs_total_lipids_Ter(3,7] -0.0022752284
## hs_total_lipids_Ter(7,Inf] -0.0476627593
## hs_total_meat_Ter(6,9] 0.0007524275
## hs_total_meat_Ter(9,Inf] 0.0005196923
## hs_total_potatoes_Ter(3,4] 0.0105526823
## hs_total_potatoes_Ter(4,Inf] 0.0048180175
## hs_total_sweets_Ter(4.1,8.5] -0.0392140671
## hs_total_sweets_Ter(8.5,Inf] -0.0010028529
## hs_total_veg_Ter(6,8.5] 0.0009962184
## hs_total_veg_Ter(8.5,Inf] -0.0556956882
## hs_total_yog_Ter(6,8.5] -0.0102351610
## hs_total_yog_Ter(8.5,Inf] -0.0089303177
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.326308
#ELASTIC NET
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates
##
## Call: cv.glmnet(x = x_diet_train, y = y_train, alpha = 0.5, family = "gaussian")
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.07218 16 1.430 0.05641 12
## 1se 0.29139 1 1.444 0.05877 0
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")
best_lambda <- fit_without_covariates$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.650606526
## h_bfdur_Ter(0,10.8] .
## h_bfdur_Ter(10.8,34.9] 0.039832328
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] -0.052635590
## hs_beverages_Ter(0.132,1] .
## hs_beverages_Ter(1,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] -0.054788470
## hs_caff_drink_Ter(0.132,Inf] .
## hs_dairy_Ter(14.6,25.6] 0.053455833
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] -0.185235916
## hs_proc_meat_Ter(1.5,4] 0.008558872
## hs_proc_meat_Ter(4,Inf] .
## hs_readymade_Ter(0.132,0.5] .
## hs_readymade_Ter(0.5,Inf] .
## hs_total_bread_Ter(7,17.5] .
## hs_total_bread_Ter(17.5,Inf] .
## hs_total_cereal_Ter(14.1,23.6] .
## hs_total_cereal_Ter(23.6,Inf] .
## hs_total_fish_Ter(1.5,3] -0.057540803
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] 0.017171763
## hs_total_fruits_Ter(14.1,Inf] -0.054914989
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] -0.094342286
## hs_total_meat_Ter(6,9] .
## hs_total_meat_Ter(9,Inf] .
## hs_total_potatoes_Ter(3,4] .
## hs_total_potatoes_Ter(4,Inf] .
## hs_total_sweets_Ter(4.1,8.5] -0.089860153
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] -0.118161721
## hs_total_yog_Ter(6,8.5] .
## hs_total_yog_Ter(8.5,Inf] .
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.335144
#selected diets that were noted in enet
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
Elastic net was selected for variable importance since not sure of which features will be considerably important compared to LASSO.
With these now selected variables, the analysis will build up from covariates and adding the other variables. Froze the covariates in lasso, ridge and enet to avoid shrinkage.
#selected chemicals that were noted in enet
chemicals_selected <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2")
#selected diets that were noted in enet
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
combined_data_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter",
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
finalized_columns <- c(combined_data_selected)
final_selected_data <- exposome %>% dplyr::select(all_of(finalized_columns))
finalized_data <- cbind(outcome_cov, final_selected_data)
head(finalized_data)
numeric_finalized <- finalized_data %>%
dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_finalized, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.6)
find_highly_correlated <- function(cor_matrix, threshold = 0.8) {
cor_matrix[lower.tri(cor_matrix, diag = TRUE)] <- NA
cor_matrix <- as.data.frame(as.table(cor_matrix))
cor_matrix <- na.omit(cor_matrix)
cor_matrix <- cor_matrix[order(-abs(cor_matrix$Freq)), ]
cor_matrix <- cor_matrix %>% filter(abs(Freq) > threshold)
return(cor_matrix)
}
highly_correlated_pairs <- find_highly_correlated(cor_matrix, threshold = 0.50)
highly_correlated_pairs
set.seed(101)
# Splitting data into training and test sets
train_indices <- sample(seq_len(nrow(finalized_data)), size = floor(0.7 * nrow(finalized_data)))
test_indices <- setdiff(seq_len(nrow(finalized_data)), train_indices)
# Creating training and test datasets
train_data <- finalized_data[train_indices, ]
test_data <- finalized_data[test_indices, ]
# Separating predictors and outcome variable
x_train <- model.matrix(~ . + 0, data = train_data[ , !names(train_data) %in% "hs_zbmi_who"])
x_test <- model.matrix(~ . + 0, data = test_data[ , !names(test_data) %in% "hs_zbmi_who"])
y_train <- train_data$hs_zbmi_who
y_test <- test_data$hs_zbmi_who
covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")
#to freeze the covariates and make sure they are not shrinked
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
baseline_data <- finalized_data %>% dplyr::select(c(covariates_selected, "hs_zbmi_who"))
x <- model.matrix(~ . -1, data = baseline_data[ , !names(baseline_data) %in% "hs_zbmi_who"])
y <- baseline_data$hs_zbmi_who
set.seed(101)
trainIndex <- createDataPartition(y, p = .7, list = FALSE, times = 1)
x_train <- x[trainIndex, ]
y_train <- y[trainIndex]
x_test <- x[-trainIndex, ]
y_test <- y[-trainIndex]
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
set.seed(101)
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Baseline Lasso MSE:", mse_lasso, "\n")
## Baseline Lasso MSE: 1.327143
cat("Baseline Lasso RMSE:", rmse_lasso, "\n")
## Baseline Lasso RMSE: 1.152017
set.seed(101)
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 8.732635e-01
## hs_child_age_None -5.738585e-02
## e3_sex_Nonefemale -3.110211e-38
## e3_sex_Nonemale 3.110211e-38
## e3_yearbir_None2004 -1.463782e-37
## e3_yearbir_None2005 2.704907e-38
## e3_yearbir_None2006 -9.332751e-39
## e3_yearbir_None2007 -5.445405e-38
## e3_yearbir_None2008 3.129795e-38
## e3_yearbir_None2009 3.620840e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Baseline Ridge MSE:", mse_ridge, "\n")
## Baseline Ridge MSE: 1.327143
cat("Baseline Ridge RMSE:", rmse_ridge, "\n")
## Baseline Ridge RMSE: 1.152017
set.seed(101)
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Baseline Elastic Net MSE:", mse_enet, "\n")
## Baseline Elastic Net MSE: 1.327143
cat("Baseline Elastic Net RMSE:", rmse_enet, "\n")
## Baseline Elastic Net RMSE: 1.152017
set.seed(101)
trainIndex <- createDataPartition(baseline_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- baseline_data[trainIndex, ]
test_data <- baseline_data[-trainIndex, ]
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_child_age_None
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.011581 0 1.00000 1.0026 0.051032
## 2 0.010000 1 0.98842 1.0130 0.051854
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Baseline Pruned Decision Tree MSE:", mse_pruned_tree, "\n")
## Baseline Pruned Decision Tree MSE: 1.319431
cat("Baseline Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Baseline Pruned Decision Tree RMSE: 1.148665
The pruned decision tree model has an RMSE of 1.148665 on the test data. This suggests that, on average, the predictions made by the model are off by about 1.15 BMI Z-score units. Since the decision tree used only one variable (hs_child_age_None), it suggests that this variable is a significant predictor of hs_zbmi_who in your dataset. However, the performance metrics indicate that there may be room for improvement, potentially by using more complex models or incorporating additional relevant features.
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500, importance = TRUE)
varImpPlot(fit_rf)
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Baseline Random Forest MSE:", mse_rf, "\n")
## Baseline Random Forest MSE: 1.342326
cat("Baseline Random Forest RMSE:", rmse_rf, "\n")
## Baseline Random Forest RMSE: 1.158588
Importance plots show that age and year of birth are the most influential variables.
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Baseline GBM MSE:", mse_gbm, "\n")
## Baseline GBM MSE: 1.320142
cat("Baseline GBM RMSE:", rmse_gbm, "\n")
## Baseline GBM RMSE: 1.148974
Model Comparison:
The results indicate that Ridge Regression performs slightly better than Lasso and Elastic Net in terms of MSE and RMSE.
The Decision Tree model has a slightly lower MSE (1.319) and RMSE (1.149) compared to Lasso, Ridge, and Elastic Net, indicating marginally better performance.
The Random Forest model has the highest MSE (1.342) and RMSE (1.159) among the models, suggesting that it may not be the best choice for this particular dataset.
The GBM model has similar performance to the Decision Tree, with an MSE of 1.320 and an RMSE of 1.149.
The differences in performance are small, indicating that none of the models significantly outperforms the others on the baseline data.
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
diet_data <- finalized_data %>% dplyr::select(c(covariates_selected, diet_selected, "hs_zbmi_who"))
set.seed(101)
trainIndex <- createDataPartition(diet_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- diet_data[trainIndex, ]
test_data <- diet_data[-trainIndex, ]
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] .
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] .
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] .
## hs_total_sweets_Ter(4.1,8.5] .
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Diet + Covariates Lasso MSE:", mse_lasso, "\n")
## Diet + Covariates Lasso MSE: 1.296766
cat("Diet + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Diet + Covariates Lasso RMSE: 1.138756
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 8.732635e-01
## hs_child_age_None -5.738585e-02
## e3_sex_Nonefemale -3.392957e-38
## e3_sex_Nonemale 3.392957e-38
## e3_yearbir_None2004 -1.596853e-37
## e3_yearbir_None2005 2.950807e-38
## e3_yearbir_None2006 -1.018118e-38
## e3_yearbir_None2007 -5.940442e-38
## e3_yearbir_None2008 3.414322e-38
## e3_yearbir_None2009 3.950007e-37
## h_bfdur_Ter(10.8,34.9] 1.959443e-37
## h_bfdur_Ter(34.9,Inf] -1.196150e-37
## hs_bakery_prod_Ter(2,6] 2.102760e-37
## hs_bakery_prod_Ter(6,Inf] -1.896677e-37
## hs_break_cer_Ter(1.1,5.5] 9.801144e-39
## hs_break_cer_Ter(5.5,Inf] -2.286856e-37
## hs_dairy_Ter(14.6,25.6] 5.634165e-38
## hs_dairy_Ter(25.6,Inf] 8.216707e-41
## hs_fastfood_Ter(0.132,0.5] 1.215262e-37
## hs_fastfood_Ter(0.5,Inf] -6.235230e-38
## hs_org_food_Ter(0.132,1] 8.146218e-38
## hs_org_food_Ter(1,Inf] -1.900266e-37
## hs_proc_meat_Ter(1.5,4] 1.193171e-37
## hs_proc_meat_Ter(4,Inf] -1.994408e-38
## hs_total_fish_Ter(1.5,3] -8.902720e-38
## hs_total_fish_Ter(3,Inf] 2.141752e-38
## hs_total_fruits_Ter(7,14.1] 2.168036e-37
## hs_total_fruits_Ter(14.1,Inf] -2.040600e-37
## hs_total_lipids_Ter(3,7] -9.170321e-39
## hs_total_lipids_Ter(7,Inf] -2.153479e-37
## hs_total_sweets_Ter(4.1,8.5] -7.677763e-38
## hs_total_sweets_Ter(8.5,Inf] -1.716694e-38
## hs_total_veg_Ter(6,8.5] 2.132182e-38
## hs_total_veg_Ter(8.5,Inf] -1.282604e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Diet + Covariates Ridge MSE:", mse_ridge, "\n")
## Diet + Covariates Ridge MSE: 1.288063
cat("Diet + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Diet + Covariates Ridge RMSE: 1.134929
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] .
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] .
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] .
## hs_total_sweets_Ter(4.1,8.5] .
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Diet + Covariates Elastic Net MSE:", mse_enet, "\n")
## Diet + Covariates Elastic Net MSE: 1.295066
cat("Diet + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Diet + Covariates Elastic Net RMSE: 1.13801
Ridge Regression continues to perform well with the addition of diet variables, maintaining its position as one of the best regularization techniques.
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_bakery_prod_Ter hs_break_cer_Ter hs_child_age_None
## [4] hs_total_lipids_Ter
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.011773 0 1.00000 1.0033 0.050928
## 2 0.010484 3 0.96468 1.0157 0.051237
## 3 0.010000 4 0.95420 1.0275 0.053129
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Decision Tree MSE:", mse_pruned_tree, "\n")
## Pruned Decision Tree MSE: 1.319431
cat("Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE: 1.148665
The top predictor is child age of above 6 and a half years, followed by bakery products and breakfast cereals in the unpruned tree.
set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data)
varImpPlot(fit_rf)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 233.57368
## e3_sex_None 34.19375
## e3_yearbir_None 97.75201
## h_bfdur_Ter 65.45676
## hs_bakery_prod_Ter 64.03655
## hs_break_cer_Ter 65.99478
## hs_dairy_Ter 69.72797
## hs_fastfood_Ter 55.45883
## hs_org_food_Ter 65.89629
## hs_proc_meat_Ter 68.85280
## hs_total_fish_Ter 63.72114
## hs_total_fruits_Ter 64.85756
## hs_total_lipids_Ter 65.79002
## hs_total_sweets_Ter 66.74543
## hs_total_veg_Ter 69.89130
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Diet + Covariates Random Forest MSE:", mse_rf, "\n")
## Diet + Covariates Random Forest MSE: 1.299369
cat("Diet + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Covariates Random Forest RMSE: 1.139899
The Random Forest model shows a slight improvement when diet variables are included. Age, year of birth, and a diet of vegetables have the most importance.
set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Diet + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Covariates GBM MSE: 1.294869
cat("Diet + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Covariates GBM RMSE: 1.137923
GBM shows a significant improvement with diet variables, indicating its effectiveness in leveraging additional predictors for better performance.
Model Comparison:
For Lasso, Ridge, Elastic Net, and GBM models, the inclusion of diet-related features and other covariates resulted in a slight improvement in MSE and RMSE. This suggests that adding these additional features helps the models to better predict the outcome variable (BMI Z-score).
The Decision Tree model showed almost no change in performance with the addition of diet-related features and other covariates.
The Ridge Regression model had the lowest MSE (1.288) and RMSE (1.135) with the diet and covariates features, indicating it is the best performing model among those compared.
Both Random Forest and GBM models showed a notable improvement with the inclusion of diet-related features and covariates. The Random Forest model’s RMSE decreased from 1.159 to 1.140, and the GBM model’s RMSE decreased from 1.149 to 1.138.
chemicals_selected <- c(
"hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
"hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
"hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
chemical_data <- finalized_data %>% dplyr::select(c(covariates_selected, chemicals_selected, "hs_zbmi_who"))
set.seed(101)
trainIndex <- createDataPartition(chemical_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_data[trainIndex, ]
test_data <- chemical_data[-trainIndex, ]
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.201642945
## hs_child_age_None -0.042729967
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 -0.040205226
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## hs_cd_c_Log2 -0.011319810
## hs_co_c_Log2 .
## hs_cs_c_Log2 .
## hs_cu_c_Log2 0.260654501
## hs_hg_c_Log2 .
## hs_mo_c_Log2 -0.069199903
## hs_pb_c_Log2 .
## hs_dde_cadj_Log2 -0.041339092
## hs_pcb153_cadj_Log2 -0.125818759
## hs_pcb170_cadj_Log2 -0.043143964
## hs_dep_cadj_Log2 -0.003677983
## hs_pbde153_cadj_Log2 -0.036720000
## hs_pfhxs_c_Log2 .
## hs_pfoa_c_Log2 -0.097979410
## hs_pfos_c_Log2 .
## hs_prpa_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.007077421
## hs_mibp_cadj_Log2 -0.025837249
## hs_mnbp_cadj_Log2 -0.004654193
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Chemical + Covariates Lasso MSE:", mse_lasso, "\n")
## Chemical + Covariates Lasso MSE: 1.082759
cat("Chemical + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Chemical + Covariates Lasso RMSE: 1.040557
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.017064370
## hs_child_age_None -0.050364241
## e3_sex_Nonefemale -0.016862646
## e3_sex_Nonemale 0.016857713
## e3_yearbir_None2004 -0.088001228
## e3_yearbir_None2005 0.019239251
## e3_yearbir_None2006 0.031700788
## e3_yearbir_None2007 -0.021259447
## e3_yearbir_None2008 -0.006970817
## e3_yearbir_None2009 0.164971168
## hs_cd_c_Log2 -0.028107768
## hs_co_c_Log2 -0.021526809
## hs_cs_c_Log2 0.044750806
## hs_cu_c_Log2 0.241316062
## hs_hg_c_Log2 -0.009375312
## hs_mo_c_Log2 -0.052460551
## hs_pb_c_Log2 -0.025851868
## hs_dde_cadj_Log2 -0.038109739
## hs_pcb153_cadj_Log2 -0.092130426
## hs_pcb170_cadj_Log2 -0.028327506
## hs_dep_cadj_Log2 -0.008871966
## hs_pbde153_cadj_Log2 -0.021002524
## hs_pfhxs_c_Log2 -0.023072685
## hs_pfoa_c_Log2 -0.083415479
## hs_pfos_c_Log2 -0.024508537
## hs_prpa_cadj_Log2 0.001511971
## hs_mbzp_cadj_Log2 0.025157058
## hs_mibp_cadj_Log2 -0.024449722
## hs_mnbp_cadj_Log2 -0.030987317
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Chemical + Covariates Ridge MSE:", mse_ridge, "\n")
## Chemical + Covariates Ridge MSE: 1.072508
cat("Chemical + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Chemical + Covariates Ridge RMSE: 1.035619
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -0.667026013
## hs_child_age_None -0.039495873
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## hs_cd_c_Log2 -0.001139167
## hs_co_c_Log2 .
## hs_cs_c_Log2 .
## hs_cu_c_Log2 0.189921157
## hs_hg_c_Log2 .
## hs_mo_c_Log2 -0.050667701
## hs_pb_c_Log2 .
## hs_dde_cadj_Log2 -0.030158657
## hs_pcb153_cadj_Log2 -0.119215474
## hs_pcb170_cadj_Log2 -0.038236462
## hs_dep_cadj_Log2 .
## hs_pbde153_cadj_Log2 -0.032755305
## hs_pfhxs_c_Log2 .
## hs_pfoa_c_Log2 -0.077859529
## hs_pfos_c_Log2 .
## hs_prpa_cadj_Log2 .
## hs_mbzp_cadj_Log2 .
## hs_mibp_cadj_Log2 -0.006808704
## hs_mnbp_cadj_Log2 .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Chemical + Covariates Elastic Net MSE:", mse_enet, "\n")
## Chemical + Covariates Elastic Net MSE: 1.08233
cat("Chemical + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Chemical + Covariates Elastic Net RMSE: 1.040351
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2 hs_dde_cadj_Log2 hs_mbzp_cadj_Log2
## [4] hs_mnbp_cadj_Log2 hs_pbde153_cadj_Log2 hs_pcb170_cadj_Log2
## [7] hs_pfoa_c_Log2
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.077388 0 1.00000 1.00330 0.050928
## 2 0.032769 1 0.92261 0.93423 0.048425
## 3 0.032579 2 0.88984 0.92796 0.047448
## 4 0.025008 3 0.85726 0.90155 0.045688
## 5 0.014062 4 0.83226 0.89664 0.046321
## 6 0.013674 6 0.80413 0.90037 0.048102
## 7 0.013602 7 0.79046 0.89910 0.048133
## 8 0.013517 8 0.77686 0.89738 0.047637
## 9 0.011314 9 0.76334 0.89439 0.046507
## 10 0.011047 11 0.74071 0.89261 0.046811
## 11 0.010329 12 0.72967 0.89860 0.047033
## 12 0.010000 14 0.70901 0.90899 0.047783
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Decision Tree MSE:", mse_pruned_tree, "\n")
## Pruned Decision Tree MSE: 1.210403
cat("Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE: 1.100183
The pruned decision tree diagram illustrates how the model makes splits based on chemical exposure variables like hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, hs_dde_cadj_Log2, and hs_mnbp_cadj_Log2.
The first split is on hs_pcb170_cadj_Log2 with a threshold of 1.2, indicating that this chemical exposure level is a critical factor in predicting BMI Z-score.
Subsequent splits further refine the predictions based on other chemical exposures, highlighting the importance of these factors in the model.
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
rf_predictions <- predict(fit_rf, newdata = test_data)
varImpPlot(fit_rf)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 47.187198
## e3_sex_None 5.878084
## e3_yearbir_None 35.112485
## hs_cd_c_Log2 51.795652
## hs_co_c_Log2 51.717642
## hs_cs_c_Log2 40.518806
## hs_cu_c_Log2 68.458209
## hs_hg_c_Log2 51.566383
## hs_mo_c_Log2 60.691194
## hs_pb_c_Log2 48.070852
## hs_dde_cadj_Log2 78.539452
## hs_pcb153_cadj_Log2 85.795111
## hs_pcb170_cadj_Log2 132.855527
## hs_dep_cadj_Log2 55.605309
## hs_pbde153_cadj_Log2 98.811874
## hs_pfhxs_c_Log2 45.792625
## hs_pfoa_c_Log2 61.349673
## hs_pfos_c_Log2 51.462842
## hs_prpa_cadj_Log2 51.557275
## hs_mbzp_cadj_Log2 57.029739
## hs_mibp_cadj_Log2 45.097714
## hs_mnbp_cadj_Log2 52.899049
mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Chemical + Covariates Random Forest MSE:", mse_rf, "\n")
## Chemical + Covariates Random Forest MSE: 1.062342
cat("Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Chemical + Covariates Random Forest RMSE: 1.0307
The features such as hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, and hs_dde_cadj_Log2 are among the top predictors, indicating that these chemical exposures are significant contributors to the model’s predictions.
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Chemical + Covariates GBM MSE:", mse_gbm, "\n")
## Chemical + Covariates GBM MSE: 1.150933
cat("Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Chemical + Covariates GBM RMSE: 1.072815
The significant reduction in MSE and RMSE with the inclusion of chemical exposures indicates that these factors are crucial for accurately predicting BMI Z-score. The GBM model benefits from these additional features, highlighting their predictive power.
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
diet_selected <- c(
"h_bfdur_Ter", "hs_bakery_prod_Ter", "hs_break_cer_Ter", "hs_dairy_Ter",
"hs_fastfood_Ter", "hs_org_food_Ter", "hs_proc_meat_Ter", "hs_total_fish_Ter",
"hs_total_fruits_Ter", "hs_total_lipids_Ter", "hs_total_sweets_Ter", "hs_total_veg_Ter"
)
chemicals_selected <- c(
"hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
"hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
"hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
combined_selected <- c(covariates_selected, diet_selected, chemicals_selected)
chemical_diet_cov_data <- finalized_data %>% dplyr::select(all_of(c(combined_selected, "hs_zbmi_who")))
set.seed(101)
trainIndex <- createDataPartition(chemical_diet_cov_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_diet_cov_data[trainIndex,]
test_data <- chemical_diet_cov_data[-trainIndex,]
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)
# make the group_indices vector
group_indices <- c(
rep(1, num_covariates), # Group 1: Covariates
rep(2, num_diet), # Group 2: Diet
rep(3, num_chemicals) # Group 3: Chemicals
)
# adjust length if needed
if (length(group_indices) < ncol(x_train)) {
group_indices <- c(group_indices, rep(4, ncol(x_train) - length(group_indices)))
}
length(group_indices) == ncol(x_train) # should be TRUE
## [1] TRUE
set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso",, penalty.factor = penalty_factors, family = "gaussian")
group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")
mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group Lasso MSE:", mse_group_lasso, "\n")
## Group Lasso MSE: 1.086867
cat("Group Lasso RMSE:", rmse_group_lasso, "\n")
## Group Lasso RMSE: 1.042529
set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")
group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge MSE:", mse_group_ridge, "\n")
## Group Ridge MSE: 1.085814
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 1.042024
set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")
group_enet_predictions <- predict(group_enet_model, x_test, type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net MSE:", mse_group_enet, "\n")
## Group Elastic Net MSE: 1.088671
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 1.043394
set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2 hs_dde_cadj_Log2 hs_fastfood_Ter
## [4] hs_mnbp_cadj_Log2 hs_pbde153_cadj_Log2 hs_pcb170_cadj_Log2
## [7] hs_pfoa_c_Log2
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.077388 0 1.00000 1.00089 0.050861
## 2 0.032769 1 0.92261 0.93652 0.048377
## 3 0.032579 2 0.88984 0.91682 0.046750
## 4 0.025008 3 0.85726 0.91218 0.046570
## 5 0.015671 4 0.83226 0.90443 0.045712
## 6 0.013674 5 0.81658 0.91485 0.047377
## 7 0.013602 6 0.80291 0.91478 0.047609
## 8 0.013517 7 0.78931 0.91478 0.047609
## 9 0.011314 8 0.77579 0.93397 0.049086
## 10 0.011047 10 0.75317 0.96270 0.050245
## 11 0.010296 11 0.74212 0.98215 0.050293
## 12 0.010000 12 0.73182 0.98607 0.050369
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Decision Tree MSE:", mse_pruned_tree, "\n")
## Pruned Decision Tree MSE: 1.187546
cat("Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE: 1.089746
The pruned decision tree diagram illustrates how the model splits based on chemical exposure variables like hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, hs_dde_cadj_Log2, and diet-related variables.
Key splits in the tree emphasize the importance of these exposures and dietary factors in predicting BMI Z-score, with the first split on hs_pcb170_cadj_Log2 at 1.2 being particularly important.
set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
varImpPlot(fit_rf)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 41.267546
## e3_sex_None 5.540866
## e3_yearbir_None 31.564485
## h_bfdur_Ter 13.136212
## hs_bakery_prod_Ter 20.689578
## hs_break_cer_Ter 13.679182
## hs_dairy_Ter 11.181729
## hs_fastfood_Ter 11.895794
## hs_org_food_Ter 10.856862
## hs_proc_meat_Ter 10.827847
## hs_total_fish_Ter 10.984827
## hs_total_fruits_Ter 13.977047
## hs_total_lipids_Ter 12.980031
## hs_total_sweets_Ter 10.749403
## hs_total_veg_Ter 12.258026
## hs_cd_c_Log2 44.886048
## hs_co_c_Log2 45.054544
## hs_cs_c_Log2 34.878871
## hs_cu_c_Log2 60.387604
## hs_hg_c_Log2 43.835480
## hs_mo_c_Log2 50.597285
## hs_pb_c_Log2 39.126962
## hs_dde_cadj_Log2 71.115165
## hs_pcb153_cadj_Log2 74.294671
## hs_pcb170_cadj_Log2 130.998985
## hs_dep_cadj_Log2 50.333164
## hs_pbde153_cadj_Log2 92.865989
## hs_pfhxs_c_Log2 38.593526
## hs_pfoa_c_Log2 53.392093
## hs_pfos_c_Log2 44.360304
## hs_prpa_cadj_Log2 43.848750
## hs_mbzp_cadj_Log2 49.661827
## hs_mibp_cadj_Log2 38.399318
## hs_mnbp_cadj_Log2 45.955350
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Diet + Chemical + Covariates Random Forest MSE:", mse_rf, "\n")
## Diet + Chemical + Covariates Random Forest MSE: 1.050314
cat("Diet + Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Covariates Random Forest RMSE: 1.024848
Features such as hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, hs_dde_cadj_Log2, and hs_mnbp_cadj_Log2 are among the top predictors, indicating significant contributions from chemical exposures.
set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Diet + Chemical + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Chemical + Covariates GBM MSE: 1.15011
cat("Diet + Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Covariates GBM RMSE: 1.072432
The RMSE also showed a notable improvement. This suggests that the model’s predictions are more accurate with the inclusion of chemical exposure data and additional diet-related features.
In terms of the metabolomic data, the values were excluded since there were too many entries that unable to be imputed by mean or median.
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/metabol_serum.RData")
metabol_serum_transposed <- as.data.frame(t(metabol_serum.d))
metabol_serum_transposed$ID <- as.integer(rownames(metabol_serum_transposed))
# add the ID column to the first position
metabol_serum_transposed <- metabol_serum_transposed[, c("ID", setdiff(names(metabol_serum_transposed), "ID"))]
# ID is the first column, and the layout is preserved
kable(head(metabol_serum_transposed), align = "c", digits = 2, format = "pipe")
| ID | metab_1 | metab_2 | metab_3 | metab_4 | metab_5 | metab_6 | metab_7 | metab_8 | metab_9 | metab_10 | metab_11 | metab_12 | metab_13 | metab_14 | metab_15 | metab_16 | metab_17 | metab_18 | metab_19 | metab_20 | metab_21 | metab_22 | metab_23 | metab_24 | metab_25 | metab_26 | metab_27 | metab_28 | metab_29 | metab_30 | metab_31 | metab_32 | metab_33 | metab_34 | metab_35 | metab_36 | metab_37 | metab_38 | metab_39 | metab_40 | metab_41 | metab_42 | metab_43 | metab_44 | metab_45 | metab_46 | metab_47 | metab_48 | metab_49 | metab_50 | metab_51 | metab_52 | metab_53 | metab_54 | metab_55 | metab_56 | metab_57 | metab_58 | metab_59 | metab_60 | metab_61 | metab_62 | metab_63 | metab_64 | metab_65 | metab_66 | metab_67 | metab_68 | metab_69 | metab_70 | metab_71 | metab_72 | metab_73 | metab_74 | metab_75 | metab_76 | metab_77 | metab_78 | metab_79 | metab_80 | metab_81 | metab_82 | metab_83 | metab_84 | metab_85 | metab_86 | metab_87 | metab_88 | metab_89 | metab_90 | metab_91 | metab_92 | metab_93 | metab_94 | metab_95 | metab_96 | metab_97 | metab_98 | metab_99 | metab_100 | metab_101 | metab_102 | metab_103 | metab_104 | metab_105 | metab_106 | metab_107 | metab_108 | metab_109 | metab_110 | metab_111 | metab_112 | metab_113 | metab_114 | metab_115 | metab_116 | metab_117 | metab_118 | metab_119 | metab_120 | metab_121 | metab_122 | metab_123 | metab_124 | metab_125 | metab_126 | metab_127 | metab_128 | metab_129 | metab_130 | metab_131 | metab_132 | metab_133 | metab_134 | metab_135 | metab_136 | metab_137 | metab_138 | metab_139 | metab_140 | metab_141 | metab_142 | metab_143 | metab_144 | metab_145 | metab_146 | metab_147 | metab_148 | metab_149 | metab_150 | metab_151 | metab_152 | metab_153 | metab_154 | metab_155 | metab_156 | metab_157 | metab_158 | metab_159 | metab_160 | metab_161 | metab_162 | metab_163 | metab_164 | metab_165 | metab_166 | metab_167 | metab_168 | metab_169 | metab_170 | metab_171 | metab_172 | metab_173 | metab_174 | metab_175 | metab_176 | metab_177 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 430 | 430 | -2.15 | -0.71 | 8.60 | 0.55 | 7.05 | 5.79 | 3.75 | 5.07 | -1.87 | -2.77 | -3.31 | -2.91 | -2.94 | -1.82 | -4.40 | -4.10 | -5.41 | -5.13 | -5.35 | -3.39 | -5.08 | -6.06 | -6.06 | -4.99 | -4.46 | -4.63 | -3.27 | -4.61 | 2.17 | -1.73 | -4.97 | -4.90 | -2.63 | -5.29 | -2.38 | -4.06 | -5.11 | -5.35 | -4.80 | -3.92 | -3.92 | -5.47 | -4.22 | -2.56 | -3.93 | 5.15 | 6.03 | 10.20 | 5.14 | 7.82 | 12.31 | 7.27 | 7.08 | 1.79 | 7.73 | 7.98 | 1.96 | 6.15 | 0.98 | 0.60 | 4.42 | 4.36 | 5.85 | 1.03 | 2.74 | -2.53 | -2.05 | -2.91 | -1.61 | -1.63 | 5.03 | 0.14 | 6.23 | -2.95 | 1.29 | 1.70 | -2.83 | 4.55 | 4.05 | 2.56 | -0.29 | 8.33 | 9.93 | 4.89 | 1.28 | 2.16 | 5.82 | 8.95 | 7.72 | 8.41 | 4.71 | 0.10 | 2.02 | 0.16 | 5.82 | 7.45 | 6.17 | 6.81 | -0.70 | -1.25 | -0.65 | 2.05 | 3.39 | 4.94 | -0.69 | -1.44 | -2.06 | -2.44 | -1.30 | -0.73 | -1.52 | -2.43 | -3.26 | 1.97 | 0.03 | 1.09 | 3.98 | 4.56 | 4.16 | 0.42 | 3.48 | 4.88 | 3.84 | 4.70 | 4.04 | 1.58 | -0.76 | 1.75 | 2.48 | 4.43 | 4.68 | 3.29 | 0.97 | 1.03 | 0.44 | 1.55 | 2.26 | 2.72 | 0.12 | -0.90 | -0.50 | 0.02 | -0.18 | 1.02 | -2.69 | -1.66 | 0.47 | 0.28 | 6.75 | 7.67 | -2.66 | -1.52 | 7.28 | -0.08 | 2.39 | 1.55 | 3.01 | 2.92 | -0.48 | 6.78 | 3.90 | 4.05 | 3.17 | -1.46 | 3.56 | 4.60 | -3.55 | -2.79 | -1.98 | -1.84 | 3.98 | 6.47 | 7.16 | -0.01 | 6.57 | 6.86 | 8.36 |
| 1187 | 1187 | -0.69 | -0.37 | 9.15 | -1.33 | 6.89 | 5.81 | 4.26 | 5.08 | -2.30 | -3.42 | -3.63 | -3.16 | -3.22 | -1.57 | -4.10 | -5.35 | -5.68 | -6.11 | -5.54 | -3.50 | -5.24 | -5.72 | -5.97 | -4.94 | -4.25 | -4.46 | -3.55 | -4.64 | 1.81 | -2.92 | -4.44 | -4.49 | -3.53 | -4.94 | -3.15 | -4.13 | -4.47 | -4.90 | -4.24 | -3.49 | -3.94 | -4.99 | -4.02 | -2.69 | -3.69 | 5.13 | 5.57 | 9.93 | 6.13 | 8.47 | 12.32 | 6.83 | 5.94 | 1.64 | 6.82 | 7.74 | 1.98 | 6.11 | 0.99 | 0.19 | 4.34 | 4.36 | 5.47 | 0.92 | 2.69 | -2.69 | -1.93 | -2.79 | -1.63 | -1.69 | 4.58 | 0.41 | 6.14 | -3.06 | 1.05 | 2.10 | -2.95 | 4.51 | 4.30 | 2.57 | 0.08 | 8.27 | 9.54 | 4.61 | 1.39 | 1.91 | 5.91 | 8.59 | 7.34 | 8.04 | 4.29 | -0.04 | 2.17 | 0.42 | 5.39 | 6.95 | 5.68 | 6.09 | -0.68 | -1.29 | -0.76 | 1.84 | 3.06 | 4.40 | -0.52 | -1.52 | -1.90 | -2.44 | -1.46 | -1.00 | -1.33 | -2.41 | -3.67 | 2.48 | 0.27 | 1.02 | 4.19 | 4.43 | 4.19 | 0.33 | 3.24 | 4.38 | 3.92 | 5.09 | 4.42 | 1.01 | -0.53 | 1.36 | 2.25 | 4.54 | 5.10 | 3.45 | 0.65 | 0.83 | 0.36 | 1.68 | 2.56 | 2.70 | 0.02 | -1.02 | -0.93 | -0.22 | 0.11 | 1.60 | -2.70 | -1.31 | 1.08 | 0.54 | 6.29 | 7.97 | -3.22 | -1.34 | 7.50 | 0.48 | 2.19 | 1.49 | 3.09 | 2.71 | -0.38 | 6.86 | 3.77 | 4.31 | 3.23 | -1.82 | 3.80 | 5.05 | -3.31 | -2.18 | -2.21 | -2.01 | 4.91 | 6.84 | 7.14 | 0.14 | 6.03 | 6.55 | 7.91 |
| 940 | 940 | -0.69 | -0.36 | 8.95 | -0.13 | 7.10 | 5.86 | 4.35 | 5.92 | -1.97 | -3.40 | -3.41 | -2.99 | -3.01 | -1.65 | -3.55 | -4.82 | -5.41 | -5.84 | -5.13 | -2.83 | -4.86 | -5.51 | -5.51 | -4.63 | -3.73 | -4.00 | -2.92 | -4.21 | 2.79 | -1.41 | -4.80 | -5.47 | -2.10 | -5.47 | -2.14 | -4.18 | -4.84 | -5.24 | -4.64 | -3.20 | -3.90 | -5.24 | -3.77 | -2.70 | -2.76 | 5.21 | 5.86 | 9.78 | 6.38 | 8.29 | 12.49 | 7.01 | 6.49 | 1.97 | 7.17 | 7.62 | 2.40 | 6.93 | 1.85 | 1.45 | 5.11 | 5.30 | 6.27 | 2.35 | 3.31 | -2.50 | -1.41 | -2.61 | -0.93 | -1.03 | 4.54 | 1.59 | 6.03 | -2.74 | 1.79 | 2.68 | -8.16 | 5.19 | 5.14 | 3.16 | 0.24 | 9.09 | 10.25 | 5.44 | 1.90 | 2.46 | 6.66 | 9.19 | 8.24 | 8.46 | 5.73 | 1.10 | 2.58 | 1.15 | 6.37 | 7.28 | 6.51 | 7.20 | -0.48 | -0.69 | -0.02 | 2.56 | 3.76 | 5.33 | -0.16 | -1.18 | -1.18 | -2.16 | -1.06 | -0.19 | -0.48 | -2.35 | -3.16 | 2.79 | 0.72 | 2.14 | 4.80 | 4.84 | 4.55 | 1.27 | 4.26 | 5.23 | 4.40 | 5.43 | 4.56 | 2.32 | 0.03 | 2.15 | 3.22 | 5.06 | 5.28 | 3.80 | 1.38 | 1.58 | 0.98 | 2.27 | 2.94 | 3.39 | 0.33 | -0.53 | 0.17 | 0.53 | 0.57 | 1.69 | -2.21 | -0.76 | 1.25 | 0.49 | 6.49 | 8.84 | -4.02 | -1.33 | 7.42 | 0.71 | 2.81 | 2.03 | 3.30 | 3.00 | -0.24 | 7.02 | 3.82 | 4.66 | 3.36 | -1.18 | 3.82 | 4.91 | -2.95 | -2.89 | -2.43 | -2.05 | 4.25 | 7.02 | 7.36 | 0.14 | 6.57 | 6.68 | 8.12 |
| 936 | 936 | -0.19 | -0.34 | 8.54 | -0.62 | 7.01 | 5.95 | 4.24 | 5.41 | -1.89 | -2.84 | -3.38 | -3.11 | -2.94 | -1.45 | -3.83 | -4.43 | -5.61 | -5.41 | -5.54 | -2.94 | -4.78 | -6.06 | -5.88 | -4.70 | -4.82 | -4.46 | -2.66 | -3.82 | 2.85 | -2.70 | -5.16 | -5.47 | -3.31 | -5.61 | -2.80 | -4.11 | -4.97 | -4.86 | -5.01 | -3.63 | -3.78 | -5.29 | -4.17 | -2.49 | -3.65 | 5.31 | 5.60 | 9.87 | 6.67 | 8.05 | 12.33 | 6.72 | 6.42 | 1.25 | 7.28 | 7.37 | 1.99 | 6.28 | 1.17 | 0.50 | 4.52 | 4.43 | 5.54 | 1.30 | 3.08 | -2.92 | -2.16 | -3.18 | -1.66 | -1.63 | 4.55 | 0.53 | 5.73 | -3.27 | 1.30 | 1.70 | -2.57 | 4.53 | 4.14 | 2.61 | -0.18 | 8.32 | 9.62 | 4.82 | 1.58 | 1.99 | 5.82 | 8.59 | 7.58 | 8.39 | 4.68 | 0.36 | 2.01 | -0.31 | 5.71 | 7.35 | 6.22 | 6.66 | -0.70 | -1.42 | -0.62 | 2.13 | 3.54 | 4.85 | -0.72 | -1.53 | -2.04 | -2.37 | -1.38 | -0.96 | -1.57 | -2.91 | -3.60 | 2.37 | 0.21 | 0.92 | 4.05 | 4.27 | 4.33 | 0.24 | 3.38 | 4.45 | 3.71 | 4.74 | 4.44 | 1.51 | -1.73 | 1.51 | 2.27 | 4.37 | 4.89 | 3.40 | 0.66 | 0.83 | 0.27 | 1.50 | 2.30 | 2.60 | 0.14 | -0.90 | -0.99 | -0.53 | -0.30 | 1.14 | -3.06 | -1.69 | 0.39 | 0.19 | 6.21 | 8.05 | -2.75 | -0.87 | 7.79 | 0.87 | 2.48 | 1.62 | 3.28 | 2.93 | -0.41 | 6.91 | 3.75 | 4.38 | 3.20 | -1.07 | 3.81 | 4.89 | -3.36 | -2.40 | -2.06 | -2.03 | 3.99 | 7.36 | 6.94 | 0.14 | 6.26 | 6.47 | 7.98 |
| 788 | 788 | -1.96 | -0.35 | 8.73 | -0.80 | 6.90 | 5.95 | 4.88 | 5.39 | -1.55 | -2.45 | -3.51 | -2.84 | -2.83 | -1.71 | -3.91 | -4.05 | -5.61 | -4.63 | -5.29 | -3.51 | -4.86 | -5.97 | -5.27 | -4.90 | -4.40 | -4.63 | -3.11 | -3.99 | 2.87 | -2.23 | -4.61 | -5.04 | -3.53 | -5.08 | -3.02 | -4.41 | -4.72 | -5.18 | -4.72 | -3.63 | -3.61 | -5.29 | -4.05 | -2.31 | -3.73 | 4.69 | 5.31 | 9.69 | 6.76 | 8.21 | 12.18 | 6.75 | 6.51 | 1.15 | 7.38 | 7.93 | 1.76 | 5.68 | -0.02 | -0.65 | 4.14 | 3.36 | 4.43 | 0.21 | 1.98 | -2.31 | -1.54 | -2.30 | -1.66 | -1.47 | 4.48 | 0.88 | 6.47 | -2.50 | 0.74 | 1.12 | -2.17 | 4.31 | 3.50 | 2.09 | -0.60 | 8.06 | 9.69 | 3.99 | 0.54 | 1.60 | 5.60 | 8.71 | 7.32 | 8.03 | 3.27 | -0.98 | 1.59 | -0.20 | 5.68 | 7.16 | 5.57 | 6.16 | -0.79 | -1.31 | -0.87 | 2.17 | 3.23 | 4.57 | -0.93 | -1.80 | -2.27 | -2.51 | -1.74 | -1.02 | -1.92 | -2.02 | -3.79 | 1.95 | -0.24 | 0.40 | 3.73 | 4.13 | 3.71 | 0.03 | 2.89 | 4.06 | 3.54 | 4.76 | 3.88 | 0.53 | -2.11 | 1.27 | 1.99 | 4.13 | 4.58 | 2.88 | 0.22 | 0.39 | 0.22 | 1.44 | 2.02 | 2.22 | 0.00 | -0.81 | -1.10 | -0.41 | -0.09 | 1.00 | -2.66 | -1.55 | 0.33 | 0.19 | 6.47 | 7.89 | -4.40 | -1.94 | 7.65 | 0.38 | 1.66 | 0.84 | 2.78 | 2.26 | -0.84 | 6.52 | 3.53 | 3.81 | 2.83 | -1.69 | 3.65 | 4.47 | -3.81 | -2.97 | -2.88 | -2.29 | 3.88 | 6.99 | 7.38 | -0.10 | 6.00 | 6.52 | 8.04 |
| 698 | 698 | -1.90 | -0.63 | 8.24 | -0.46 | 6.94 | 5.42 | 4.70 | 4.62 | -1.78 | -3.14 | -3.46 | -2.90 | -2.94 | -1.65 | -4.20 | -4.56 | -5.68 | -5.61 | -5.41 | -2.92 | -5.04 | -5.97 | -6.06 | -4.90 | -4.22 | -4.20 | -3.05 | -4.61 | 2.15 | -2.87 | -4.68 | -5.08 | -3.69 | -5.24 | -3.63 | -4.24 | -5.16 | -5.35 | -4.97 | -3.61 | -3.99 | -5.35 | -3.98 | -2.59 | -3.95 | 5.15 | 5.82 | 10.00 | 5.54 | 8.15 | 12.28 | 6.80 | 6.23 | 1.88 | 7.07 | 7.38 | 2.06 | 6.79 | 1.67 | 1.00 | 4.79 | 4.79 | 5.71 | 1.99 | 3.29 | -2.13 | -1.01 | -1.85 | -1.23 | -0.90 | 4.41 | -0.02 | 6.09 | -2.10 | 1.66 | 2.27 | -3.48 | 4.96 | 4.76 | 2.64 | 0.05 | 8.91 | 9.99 | 5.16 | 1.53 | 2.11 | 6.28 | 8.77 | 8.03 | 8.66 | 5.99 | 0.87 | 2.30 | 0.63 | 6.23 | 7.50 | 6.75 | 7.22 | -0.45 | -0.81 | -0.11 | 2.57 | 3.93 | 5.16 | -0.31 | -1.19 | -1.25 | -1.93 | -0.89 | 0.07 | -0.87 | -1.12 | -3.03 | 2.61 | 0.54 | 1.83 | 4.50 | 4.53 | 4.42 | 1.15 | 4.02 | 4.91 | 4.06 | 5.06 | 4.42 | 2.02 | -1.03 | 1.87 | 2.96 | 4.84 | 5.08 | 3.62 | 1.13 | 1.23 | 0.75 | 2.26 | 2.80 | 3.04 | 0.41 | -0.39 | 0.02 | 0.31 | 0.52 | 1.73 | -2.28 | -0.73 | 1.06 | 0.72 | 6.44 | 7.27 | -3.08 | -1.23 | 7.35 | 0.92 | 2.60 | 2.00 | 3.69 | 3.20 | -0.25 | 7.38 | 4.15 | 5.00 | 3.88 | -1.39 | 4.31 | 5.20 | -3.47 | -2.75 | -1.97 | -1.96 | 4.18 | 6.81 | 6.75 | 0.02 | 6.49 | 5.97 | 7.78 |
# specific covariates
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")
# specific covariates
filtered_covariates <- codebook %>%
filter(domain == "Covariates" &
variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "hs_child_age_None"))
#specific phenotype variables
filtered_phenotype <- codebook %>%
filter(domain == "Phenotype" &
variable_name %in% c("hs_zbmi_who"))
# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
outcome_and_cov <- cbind(covariates, outcome_BMI)
outcome_and_cov <- outcome_and_cov[, !duplicated(colnames(outcome_and_cov))]
outcome_and_cov <- outcome_and_cov %>%
dplyr::select(ID, hs_child_age_None, e3_sex_None, e3_yearbir_None, hs_zbmi_who)
#the full chemicals list
chemicals_specific <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
#postnatal diet for child
postnatal_diet <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_readymade_Ter",
"hs_total_bread_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_potatoes_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
all_columns <- c(chemicals_specific, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))
selected_id_data <- cbind(outcome_and_cov, extracted_exposome)
# ID is the common identifier in both datasets
combined_data <- merge(selected_id_data, metabol_serum_transposed, by = "ID", all = TRUE)
selected_metabolomics_data <- combined_data %>% dplyr::select(-c(ID))
head(selected_metabolomics_data)
selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()
set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
metabol_train_data <- selected_metabolomics_data[trainIndex,]
metabol_test_data <- selected_metabolomics_data[-trainIndex,]
x_train <- model.matrix(hs_zbmi_who ~ . -1, metabol_train_data)
y_train <- metabol_train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, metabol_test_data)
y_test <- metabol_test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)
num_metabolomics <- ncol(metabol_serum_transposed) - 1 # subtract for the ID column
group_indices <- c(
rep(1, num_covariates), # Group 1: Covariates
rep(2, num_diet), # Group 2: Diet
rep(3, num_chemicals), # Group 3: Chemicals
rep(4, num_metabolomics) # Group 4: Metabolomics
)
if (length(group_indices) < ncol(x_train)) {
group_indices <- c(group_indices, rep(5, ncol(x_train) - length(group_indices)))
}
length(group_indices) == ncol(x_train) # This should be TRUE
## [1] TRUE
# Fit group lasso model
set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian")
group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")
mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group Lasso Test MSE:", mse_group_lasso, "\n")
## Group Lasso Test MSE: 0.8713226
cat("Group Lasso Test RMSE:", rmse_group_lasso, "\n")
## Group Lasso Test RMSE: 0.9334466
set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")
group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge MSE:", mse_group_ridge, "\n")
## Group Ridge MSE: 0.8883061
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 0.9424999
set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")
group_enet_predictions <- predict(group_enet_model, x_test, type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net MSE:", mse_group_enet, "\n")
## Group Elastic Net MSE: 0.8912767
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 0.9440745
selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()
set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7,
list = FALSE,
times = 1)
train_data <- selected_metabolomics_data[ trainIndex,]
test_data <- selected_metabolomics_data[-trainIndex,]
x_train <- model.matrix(hs_zbmi_who ~ ., train_data)[,-1]
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ ., test_data)[,-1]
y_test <- test_data$hs_zbmi_who
set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2 hs_dde_cadj_Log2 hs_pcb170_cadj_Log2
## [4] metab_100 metab_129 metab_140
## [7] metab_141 metab_142 metab_157
## [10] metab_161 metab_176 metab_47
## [13] metab_58 metab_8 metab_94
## [16] metab_95
##
## Root node error: 1254/841 = 1.4911
##
## n= 841
##
## CP nsplit rel error xerror xstd
## 1 0.089116 0 1.00000 1.00275 0.052776
## 2 0.069078 1 0.91088 0.94590 0.050339
## 3 0.037249 2 0.84181 0.91981 0.049729
## 4 0.034324 3 0.80456 0.90111 0.048711
## 5 0.025576 4 0.77023 0.89055 0.049604
## 6 0.021936 5 0.74466 0.90166 0.048993
## 7 0.021276 6 0.72272 0.90895 0.049515
## 8 0.020244 8 0.68017 0.90807 0.050310
## 9 0.016938 9 0.65992 0.91277 0.050356
## 10 0.015553 10 0.64299 0.90306 0.050272
## 11 0.015178 11 0.62743 0.90122 0.050368
## 12 0.014756 12 0.61226 0.90066 0.050300
## 13 0.014022 13 0.59750 0.91203 0.051412
## 14 0.013859 14 0.58348 0.92227 0.051732
## 15 0.012010 15 0.56962 0.93488 0.053622
## 16 0.011088 16 0.55761 0.93327 0.053609
## 17 0.011027 17 0.54652 0.92988 0.053560
## 18 0.010000 18 0.53549 0.93467 0.053891
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Decision Tree MSE:", mse_pruned_tree, "\n")
## Pruned Decision Tree MSE: 1.272858
cat("Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE: 1.12821
set.seed(101)
rf_model <- randomForest(hs_zbmi_who ~ . , data = train_data, ntree = 500)
importance(rf_model)
## IncNodePurity
## hs_child_age_None 4.7927709
## e3_sex_None 0.5707810
## e3_yearbir_None 5.3687139
## hs_cd_c_Log2 4.8057550
## hs_co_c_Log2 5.5171622
## hs_cs_c_Log2 5.0356667
## hs_cu_c_Log2 12.4603237
## hs_hg_c_Log2 6.5609155
## hs_mo_c_Log2 9.6711331
## hs_pb_c_Log2 6.0711093
## hs_dde_cadj_Log2 13.2357957
## hs_pcb153_cadj_Log2 42.5218764
## hs_pcb170_cadj_Log2 78.6536355
## hs_dep_cadj_Log2 5.7451267
## hs_pbde153_cadj_Log2 28.0252092
## hs_pfhxs_c_Log2 5.4200119
## hs_pfoa_c_Log2 9.0245854
## hs_pfos_c_Log2 6.9947002
## hs_prpa_cadj_Log2 5.0401167
## hs_mbzp_cadj_Log2 4.4169868
## hs_mibp_cadj_Log2 4.0028339
## hs_mnbp_cadj_Log2 5.0996136
## h_bfdur_Ter 2.8968599
## hs_bakery_prod_Ter 2.9013737
## hs_dairy_Ter 1.2216138
## hs_fastfood_Ter 0.9580573
## hs_org_food_Ter 1.1570809
## hs_readymade_Ter 1.4572844
## hs_total_bread_Ter 1.1592172
## hs_total_fish_Ter 1.4961459
## hs_total_fruits_Ter 1.1767595
## hs_total_lipids_Ter 1.3061318
## hs_total_potatoes_Ter 1.2547008
## hs_total_sweets_Ter 0.9540810
## hs_total_veg_Ter 1.0142937
## metab_1 4.5011841
## metab_2 4.1531041
## metab_3 3.4379265
## metab_4 5.7831834
## metab_5 2.9178829
## metab_6 9.2084145
## metab_7 4.5028527
## metab_8 36.1613265
## metab_9 2.8921966
## metab_10 3.3253911
## metab_11 4.0194203
## metab_12 3.3756205
## metab_13 4.1227190
## metab_14 5.2837840
## metab_15 4.4314031
## metab_16 2.6462109
## metab_17 2.4025892
## metab_18 3.0563512
## metab_19 2.2951339
## metab_20 3.9067077
## metab_21 2.8607845
## metab_22 2.6869075
## metab_23 2.8972758
## metab_24 3.7219939
## metab_25 3.4797174
## metab_26 7.4516177
## metab_27 3.3263961
## metab_28 3.6858414
## metab_29 3.3029111
## metab_30 18.5312981
## metab_31 3.4436727
## metab_32 3.0545040
## metab_33 5.2038360
## metab_34 2.4118531
## metab_35 7.7143364
## metab_36 3.8489845
## metab_37 3.5293186
## metab_38 2.5404708
## metab_39 2.5165610
## metab_40 5.1546331
## metab_41 3.8921235
## metab_42 6.1284140
## metab_43 3.4008043
## metab_44 3.5698540
## metab_45 3.6758543
## metab_46 5.1802108
## metab_47 6.2120208
## metab_48 11.7806465
## metab_49 33.0932505
## metab_50 9.9343588
## metab_51 6.2231737
## metab_52 3.3042007
## metab_53 5.1286048
## metab_54 4.9482602
## metab_55 7.9980726
## metab_56 4.6904911
## metab_57 4.7919533
## metab_58 3.2769554
## metab_59 5.7319035
## metab_60 3.9192335
## metab_61 3.0283668
## metab_62 4.3267745
## metab_63 4.8698592
## metab_64 3.9307501
## metab_65 3.8785658
## metab_66 2.5032525
## metab_67 2.7330496
## metab_68 3.5029673
## metab_69 2.4526201
## metab_70 3.2796735
## metab_71 4.6222410
## metab_72 3.4044271
## metab_73 3.0460977
## metab_74 2.7018733
## metab_75 3.8889307
## metab_76 2.8113720
## metab_77 4.7832198
## metab_78 4.9096271
## metab_79 4.0912870
## metab_80 3.5334220
## metab_81 3.3897748
## metab_82 4.6950398
## metab_83 3.3638699
## metab_84 3.0885189
## metab_85 4.6423208
## metab_86 3.2104952
## metab_87 3.8282082
## metab_88 2.8852207
## metab_89 2.7940235
## metab_90 2.6775937
## metab_91 3.2999703
## metab_92 3.4644294
## metab_93 2.8208312
## metab_94 8.2011811
## metab_95 51.6744901
## metab_96 8.1118854
## metab_97 3.0816526
## metab_98 3.2655000
## metab_99 4.4506345
## metab_100 3.9391564
## metab_101 3.6374127
## metab_102 5.7070910
## metab_103 4.1270357
## metab_104 5.6319496
## metab_105 3.3000728
## metab_106 3.4540736
## metab_107 3.8068032
## metab_108 4.0689264
## metab_109 5.3846332
## metab_110 7.8695016
## metab_111 2.7218197
## metab_112 2.7613608
## metab_113 5.5357192
## metab_114 4.3740721
## metab_115 4.3476546
## metab_116 6.1735729
## metab_117 7.0544259
## metab_118 2.4601605
## metab_119 5.4450427
## metab_120 7.1242577
## metab_121 3.7051871
## metab_122 6.2282665
## metab_123 3.2628782
## metab_124 3.3382452
## metab_125 3.4201616
## metab_126 2.4057902
## metab_127 5.0861475
## metab_128 6.4178268
## metab_129 4.2203870
## metab_130 3.4512521
## metab_131 3.2548319
## metab_132 3.2654363
## metab_133 2.6708217
## metab_134 3.2328840
## metab_135 5.2587159
## metab_136 5.0406508
## metab_137 6.3400908
## metab_138 4.7455979
## metab_139 3.7303384
## metab_140 2.7325221
## metab_141 7.9353350
## metab_142 13.5320442
## metab_143 9.6865623
## metab_144 3.9033383
## metab_145 4.2785181
## metab_146 4.2770667
## metab_147 3.5426166
## metab_148 3.3025573
## metab_149 5.6260285
## metab_150 4.9789354
## metab_151 3.5939464
## metab_152 4.7580724
## metab_153 4.3991465
## metab_154 5.0953881
## metab_155 2.5069093
## metab_156 2.8819060
## metab_157 4.3767383
## metab_158 3.9225685
## metab_159 3.3643529
## metab_160 7.0114089
## metab_161 28.1188870
## metab_162 3.9630717
## metab_163 14.4399347
## metab_164 7.1747718
## metab_165 3.9310471
## metab_166 3.8429719
## metab_167 3.3589423
## metab_168 3.1719266
## metab_169 4.2720474
## metab_170 4.2149550
## metab_171 4.9419993
## metab_172 3.7296383
## metab_173 4.0956131
## metab_174 4.6193564
## metab_175 3.7661891
## metab_176 6.2453467
## metab_177 14.2757880
par(mar = c(6, 14, 4, 4) + 0.1)
varImpPlot(rf_model, cex = 0.6)
rf_predictions <- predict(rf_model, newdata = test_data)
rf_mse <- mean((rf_predictions - y_test)^2)
rmse_rf <- sqrt(rf_mse)
cat(" Diet + Chemical + Metabolomic + Covariates Random Forest MSE:", rf_mse, "\n")
## Diet + Chemical + Metabolomic + Covariates Random Forest MSE: 1.00938
cat("Diet + Chemical + Metabolomic + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Metabolomic + Covariates Random Forest RMSE: 1.004679
gbm_model <- gbm(hs_zbmi_who ~ ., data = train_data,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
n.minobsinnode = 10,
shrinkage = 0.01,
cv.folds = 5,
verbose = TRUE)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.4854 nan 0.0100 0.0031
## 2 1.4805 nan 0.0100 0.0028
## 3 1.4757 nan 0.0100 0.0018
## 4 1.4713 nan 0.0100 0.0028
## 5 1.4668 nan 0.0100 0.0017
## 6 1.4626 nan 0.0100 0.0020
## 7 1.4581 nan 0.0100 0.0040
## 8 1.4534 nan 0.0100 0.0032
## 9 1.4483 nan 0.0100 0.0038
## 10 1.4446 nan 0.0100 0.0029
## 20 1.3979 nan 0.0100 0.0035
## 40 1.3186 nan 0.0100 0.0021
## 60 1.2523 nan 0.0100 0.0018
## 80 1.1946 nan 0.0100 0.0013
## 100 1.1448 nan 0.0100 0.0007
## 120 1.0983 nan 0.0100 0.0003
## 140 1.0562 nan 0.0100 0.0006
## 160 1.0160 nan 0.0100 0.0006
## 180 0.9785 nan 0.0100 0.0004
## 200 0.9443 nan 0.0100 0.0010
## 220 0.9120 nan 0.0100 0.0005
## 240 0.8802 nan 0.0100 0.0006
## 260 0.8527 nan 0.0100 -0.0002
## 280 0.8277 nan 0.0100 0.0009
## 300 0.8038 nan 0.0100 0.0001
## 320 0.7802 nan 0.0100 0.0003
## 340 0.7576 nan 0.0100 0.0004
## 360 0.7374 nan 0.0100 0.0004
## 380 0.7185 nan 0.0100 0.0003
## 400 0.7012 nan 0.0100 -0.0001
## 420 0.6842 nan 0.0100 0.0001
## 440 0.6683 nan 0.0100 0.0001
## 460 0.6529 nan 0.0100 0.0000
## 480 0.6372 nan 0.0100 -0.0002
## 500 0.6226 nan 0.0100 0.0000
## 520 0.6085 nan 0.0100 0.0002
## 540 0.5954 nan 0.0100 0.0003
## 560 0.5827 nan 0.0100 -0.0002
## 580 0.5703 nan 0.0100 -0.0003
## 600 0.5584 nan 0.0100 -0.0002
## 620 0.5477 nan 0.0100 -0.0001
## 640 0.5368 nan 0.0100 0.0001
## 660 0.5267 nan 0.0100 -0.0000
## 680 0.5159 nan 0.0100 0.0001
## 700 0.5062 nan 0.0100 0.0001
## 720 0.4966 nan 0.0100 -0.0000
## 740 0.4872 nan 0.0100 -0.0002
## 760 0.4786 nan 0.0100 -0.0003
## 780 0.4704 nan 0.0100 -0.0000
## 800 0.4618 nan 0.0100 -0.0001
## 820 0.4538 nan 0.0100 -0.0000
## 840 0.4458 nan 0.0100 -0.0001
## 860 0.4381 nan 0.0100 -0.0001
## 880 0.4307 nan 0.0100 -0.0001
## 900 0.4237 nan 0.0100 -0.0001
## 920 0.4164 nan 0.0100 -0.0002
## 940 0.4097 nan 0.0100 -0.0002
## 960 0.4032 nan 0.0100 -0.0001
## 980 0.3966 nan 0.0100 -0.0000
## 1000 0.3905 nan 0.0100 -0.0000
summary(gbm_model)
# finding the best number of trees based on cross-validation
best_trees <- gbm.perf(gbm_model, method = "cv")
predictions_gbm <- predict(gbm_model, test_data, n.trees = best_trees)
mse_gbm <- mean((y_test - predictions_gbm)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Diet + Chemical + Metabolomic + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Chemical + Metabolomic + Covariates GBM MSE: 0.8857186
cat("Diet + Chemical + Metabolomic + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Metabolomic + Covariates GBM RMSE: 0.9411262
summary(gbm_model)
Including diet, chemical, and metabolomic variables with the covariates leads to the best improvements in model performance, indicating these additional predictors provide substantial predictive value.
The GBM model shows the best performance among the advanced models, highlighting its ability to effectively leverage a comprehensive set of predictors.
results <- data.frame(
Model = rep(c("Lasso", "Ridge", "Elastic Net", "Decision Tree", "Random Forest", "GBM"), each = 5),
Data_Set = rep(c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"), 6),
MSE = c(1.327, 1.297, 1.083, 1.087, 0.871,
1.327, 1.288, 1.072, 1.086, 0.888,
1.327, 1.295, 1.082, 1.089, 0.891,
1.319, 1.319, 1.210, 1.187, 1.273,
1.342, 1.299, 1.062, 1.050, 1.009,
1.320, 1.295, 1.151, 1.150, 0.889),
RMSE = c(1.152, 1.139, 1.041, 1.042, 0.933,
1.152, 1.135, 1.036, 1.042, 0.942,
1.152, 1.138, 1.040, 1.043, 0.944,
1.149, 1.148, 1.100, 1.090, 1.128,
1.159, 1.140, 1.031, 1.025, 1.005,
1.149, 1.138, 1.073, 1.072, 0.943)
)
mse_plot <- ggplot(results, aes(x = Data_Set, y = MSE, fill = Model)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
rmse_plot <- ggplot(results, aes(x = Data_Set, y = RMSE, fill = Model)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(mse_plot)
print(rmse_plot)
# filter results for Lasso model
lasso_results <- subset(results, Model == "Lasso")
mse_lasso_plot <- ggplot(lasso_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Lasso Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
rmse_lasso_plot <- ggplot(lasso_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Lasso Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(mse_lasso_plot)
print(rmse_lasso_plot)
# filter results for ridge model
ridge_results <- subset(results, Model == "Ridge")
mse_ridge_plot <- ggplot(ridge_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Ridge Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
rmse_ridge_plot <- ggplot(ridge_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Ridge Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(mse_ridge_plot)
print(rmse_ridge_plot)
# filter results for elastic net model
enet_results <- subset(results, Model == "Elastic Net")
mse_enet_plot <- ggplot(enet_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Elastic Net Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
rmse_enet_plot <- ggplot(enet_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Elastic Net Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(mse_enet_plot)
print(rmse_enet_plot)
# filter results for decision tree model
tree_results <- subset(results, Model == "Decision Tree")
mse_tree_plot <- ggplot(tree_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Decision Tree Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
rmse_tree_plot <- ggplot(lasso_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Decision Tree Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(mse_tree_plot)
print(rmse_tree_plot)
# filter results for random forest model
rf_results <- subset(results, Model == "Random Forest")
mse_rf_plot <- ggplot(rf_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Random Forest Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
rmse_rf_plot <- ggplot(rf_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Random Forest Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(mse_rf_plot)
print(rmse_rf_plot)
# filter results for elastic net model
gbm_results <- subset(results, Model == "GBM")
mse_gbm_plot <- ggplot(gbm_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "GBM Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
rmse_gbm_plot <- ggplot(gbm_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "GBM Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(mse_gbm_plot)
print(rmse_gbm_plot)